11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Review of the MIPS Instruction Set Architecture. RISC Instruction Set Basics All operations on data apply to data in registers and typically change the.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Chapter 3 Instruction Set Architecture Advanced Computer Architecture COE 501.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
1 ECE369 ECE369 Chapter 2. 2 ECE369 Instruction Set Architecture A very important abstraction –interface between hardware and low-level software –standardizes.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Click to add text © IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Computer Organization and Architecture
Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute.
Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
CS4961 Parallel Programming Chapter 5 Introduction to SIMD Johnnie Baker January 14,
Chapter 12 CPU Structure and Function. Example Register Organizations.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
New Algorithms for SIMD Alignment Liza Fireman - Technion Ayal Zaks – IBM Haifa Research Lab Erez Petrank – Microsoft Research & Technion.
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
09/27/2011CS4961 CS4961 Parallel Programming Lecture 10: Introduction to SIMD Mary Hall September 27, 2011.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Programmer's view on Computer Architecture by Istvan Haller.
Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.
09/14/2010CS4961 CS4961 Parallel Programming Lecture 7: Introduction to SIMD Mary Hall September 14, 2010.
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
CS4961 Parallel Programming Lecture 5: Data and Task Parallelism, cont
Single Node Optimization Computational Astrophysics.
Computer Organization Instructions Language of The Computer (MIPS) 2.
EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.
My Coordinates Office EM G.27 contact time:
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.
Vector computers.
09/10/2010CS4961 CS4961 Parallel Programming Lecture 6: SIMD Parallelism in SSE-3 Mary Hall September 10,
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
Computer Architecture & Operations I
Instruction Set Architecture
SIMD Multimedia Extensions
CS4961 Parallel Programming Lecture 8: SIMD, cont
Morgan Kaufmann Publishers
Vector Processing => Multimedia
SIMD Programming CS 240A, 2017.
CS170 Computer Organization and Architecture I
The University of Adelaide, School of Computer Science
CS4961 Parallel Programming Lecture 11: SIMD, cont
Optimizing Transformations Hal Perkins Winter 2008
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Samuel Larsen and Saman Amarasinghe, MIT CSAIL
Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
COMS 361 Computer Organization
Introduction to Microprocessor Programming
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Presentation transcript:

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012

Today’s Lecture Practical understanding of SIMD in the context of multimedia extensions Slide source: -Sam Larsen, PLDI 2000, -Jaewook Shin, my former PhD student 11/13/2012CS4230

SIMD and Multimedia Extensions Common feature of most microprocessors -Very different architecture from GPUs Mostly, you use this feature without realizing it if you use an optimization flag (typically –O2 or above) You can improve the result substantially with some awareness of challenges for architecture “Wide SIMD” is becoming common in high-end architectures -Intel AVX -Intel MIC -IBM BG/Q 11/13/2012CS4230

Multimedia Extension Architectures At the core of multimedia extensions -SIMD parallelism -Variable-sized data fields -Vector length = register width / type size -Data in contiguous memory 11/13/2012CS4230 Slide source: Jaewook Shin

11/13/2012CS4230 Characteristics of Multimedia Applications Regular data access pattern -Data items are contiguous in memory Short data types -8, 16, 32 bits Data streaming through a series of processing stages -Some temporal reuse for such data streams Sometimes … -Many constants -Short iteration counts -Requires saturation arithmetic

Programming Multimedia Extensions Use compiler or low-level API - Programming interface similar to function call -C: built-in functions, Fortran: intrinsics -Most native compilers support their own multimedia extensions -GCC: -faltivec, -msse2 -AltiVec: dst= vec_add(src1, src2); -SSE2: dst= _mm_add_ps(src1, src2); -BG/L: dst= __fpadd(src1, src2); -No Standard ! 11/13/2012CS4230

11/13/2012CS Independent ALU Ops R = R + XR * G = G + XG * B = B + XB * R R XR G = G + XG * B B XB Slide source: Sam Larsen

11/13/2012CS Adjacent Memory References R = R + X[i+0] G = G + X[i+1] B = B + X[i+2] R G = G + X[i:i+2] B Slide source: Sam Larsen

11/13/2012CS4230 for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0] 3. Vectorizable Loops Slide source: Sam Larsen

11/13/2012CS Vectorizable Loops for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3] for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0] A[i+1] = A[i+1] + B[i+1] A[i+2] = A[i+2] + B[i+2] A[i+3] = A[i+3] + B[i+3] Slide source: Sam Larsen

11/13/ Partially Vectorizable Loops for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L) CS4230 Slide source: Sam Larsen

11/13/ Partially Vectorizable Loops for (i=0; i<16; i+=2) L0 L1 = A[i:i+1] – B[i:i+1] D = D + abs(L0) D = D + abs(L1) for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L) L = A[i+1] – B[i+1] D = D + abs(L) CS4230 Slide source: Sam Larsen

Rest of Lecture 1.Understanding multimedia execution 2.Understanding the overheads 3.Understanding how to write code to deal with the overheads What are the overheads: -Packing and unpacking: -rearrange data so that it is contiguous -Alignment overhead -Accessing data from the memory system so that it is aligned to a “superword” boundary -Control flow -Control flow may require executing all paths 11/13/2012CS4230

11/13/2012CS4230 Packing/Unpacking Costs C = A + 2 D = B + 3 C A 2 D B 3 = + Slide source: Sam Larsen

11/13/2012CS4230 Packing/Unpacking Costs Packing source operands -Copying into contiguous memory A B A = f() B = g() C = A + 2 D = B + 3 C A 2 D B 3 = +

11/13/2012CS4230 Packing/Unpacking Costs Packing source operands -Copying into contiguous memory Unpacking destination operands -Copying back to location C D A = f() B = g() C = A + 2 D = B + 3 E = C / 5 F = D * 7 A B C A 2 D B 3 = + Slide source: Sam Larsen

11/13/2012CS4230 Alignment Code Generation Aligned memory access -The address is always a multiple of vector length (16 bytes in example) -Just one superword load or store instruction float a[64]; for (i=0; i<64; i+=4) Va = a[i:i+3]; …

11/13/2012CS4230 Alignment Code Generation (cont.) Misaligned memory access -The address is always a non-zero constant offset away from the 16 byte boundaries. -Static alignment: For a misaligned load, issue two adjacent aligned loads followed by a merge. -Sometimes the hardware does this for you, but still results in multiple loads float a[64]; for (i=0; i<60; i+=4) Va = a[i+2:i+5]; … float a[64]; for (i=0; i<60; i+=4) V1 = a[i:i+3]; V2 = a[i+4:i+7]; Va = merge(V1, V2, 8);

Statically align loop iterations float a[64]; for (i=0; i<60; i+=4) Va = a[i+2:i+5]; float a[64]; Sa2 = a[2]; Sa3 = a[3]; for (i=4; i<64; i+=4) Va = a[i:i+3]; 11/13/2012CS4230 Alignment Code Generation (cont.)

11/13/2012CS4230 Alignment Code Generation (cont.) Unaligned memory access -The offset from 16 byte boundaries is varying or not enough information is available. -Dynamic alignment: The merging point is computed during run time. float a[64]; start = read(); for (i=start; i<60; i++) Va = a[i:i+3]; float a[64]; start = read(); for (i=start; i<60; i++) V1 = a[i:i+3]; V2 = a[i+4:i+7]; align = (&a[i:i+3])%16; Va = merge(V1, V2, align);

Summary of dealing with alignment issues Worst case is dynamic alignment based on address calculation (previous slide) Compiler (or programmer) can use analysis to prove data is already aligned -We know that data is initially aligned at its starting address by convention -If we are stepping through a loop with a constant starting point and accessing the data sequentially, then it preserves the alignment across the loop We can adjust computation to make it aligned by having a sequential portion until aligned, followed by a SIMD portion, possibly followed by a sequential cleanup Sometimes alignment overhead is so significant that there is no performance gain from SIMD execution 11/13/2012CS4230

Last SIMD issue: Control Flow What if I have control flow? -Both control flow paths must be executed! 11/13/2012CS4230 for (i=0; i<16; i++) if (a[i] != 0) b[i]++; for (i=0; i<16; i++) if (a[i] != 0) b[i] = b[i] / a[i]; else b[i]++; What happens: Compute a[i] !=0 for all fields Compare b[i]++ for all fields in temporary t1 Copy b[i] into another register t2 Merge t1 and t2 according to value of a[i]!=0 What happens: Compute a[i] !=0 for all fields Compute b[i] = b[i]/a[i] in register t1 Compare b[i]++ for all fields in t2 Merge t1 and t2 according to value of a[i]!=0

Reasons why Compiler Fails Parallelize Code Dependence interferes with parallelization -Sometimes assertion of no alias will solve this -Sometimes assertion of no dependence will solve this Anticipates too much overhead will eliminate gain -Costly alignment, or can’t prove alignment -Number of loop iterations too small 11/13/2012CS4230