EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Streaming SIMD Extension (SSE)

MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.

Chapter 3 Instruction Set Architecture Advanced Computer Architecture COE 501.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

The University of Adelaide, School of Computer Science

Microprocessors General Features To be Examined For Each Chip Jan 24 th, 2002.

Advanced microprocessor optimization Kampala August, 2007 Agner Fog

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Computer Organization and Architecture

 Understanding the Sources of Inefficiency in General-Purpose Chips.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

Chapter 17 Parallel Processing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Basics and Architectures

09/27/2011CS4961 CS4961 Parallel Programming Lecture 10: Introduction to SIMD Mary Hall September 27, 2011.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

09/14/2010CS4961 CS4961 Parallel Programming Lecture 7: Introduction to SIMD Mary Hall September 14, 2010.

ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.

Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Introduction to MMX, XMM, SSE and SSE2 Technology

CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.

EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri.

Parallel Computing Presented by Justin Reschke

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

My Coordinates Office EM G.27 contact time:

A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.

09/10/2010CS4961 CS4961 Parallel Programming Lecture 6: SIMD Parallelism in SSE-3 Mary Hall September 10,

Compilers: History and Context COMP Outline Compilers and languages Compilers and architectures – parallelism – memory hierarchies Other uses.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

CS4961 Parallel Programming Lecture 8: SIMD, cont

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.

Morgan Kaufmann Publishers

Vector Processing => Multimedia

Compiler Back End Panel

STUDY AND IMPLEMENTATION

CS4961 Parallel Programming Lecture 11: SIMD, cont

Multiprocessors - Flynn’s taxonomy (1966)

Compiler Back End Panel

Coe818 Advanced Computer Architecture

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

EE 4xx: Computer Architecture and Performance Programming

The Vector-Thread Architecture

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

6- General Purpose GPU Programming

Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

Presentation transcript:

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012

- 2 - Announcements v Last class today! »No more reading v Dec – Project presentations »Each group sign up for 30-minute slot »See me after class if you have not signed up v Course evaluations reminder »Please fill one out, it will only take 5 minutes »I do read them »Improve the experience for future 583 students

- 3 - Notes on Project Demos v Demo format »Each group gets 30 minutes  Strict deadlines enforced because many back to back groups  Don’t be late!  Figure out your room number ahead of time (see schedule on my door) »Plan for 20 mins of presentation (no more!), 10 mins questions  Some slides are helpful, try to have all group members say something  Talk about what you did (basic idea, previous work), how you did it (approach + implementation), and results  Demo or real code examples are good v Report »5 pg double spaced including figures – what you did + why, implementation, and results »Due either when you do your demo or Dec 18 at 6pm

- 4 - SIMD Processors: Larrabee (now called Knights Corner) Block Diagram

- 5 - Vector Unit Block Diagram

- 6 - Processor Core Block Diagram

- 7 - Larrabee vs Conventional GPUs v Each Larrabee core is a complete Intel processor »Context switching & pre-emptive multi-tasking »Virtual memory and page swapping, even in texture logic »Fully coherent caches at all levels of the hierarchy v Efficient inter-block communication »Ring bus for full inter-processor communication »Low latency high bandwidth L1 and L2 caches »Fast synchronization between cores and caches v Larrabee: the programmability of IA with the parallelism of graphics processors

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Multimedia Extensions Additions to all major ISAs SIMD operations

Using Multimedia Extensions Library calls and inline assembly –Difficult to program –Not portable Different extensions to the same ISA –MMX and SSE –SSE vs. 3DNow! Need automatic compilation

Vector Compilation Pros: –Successful for vector computers –Large body of research Cons: –Involved transformations –Targets loop nests

Superword Level Parallelism (SLP) Small amount of parallelism –Typically 2 to 8-way Exists within basic blocks Uncovered with a simple analysis Independent isomorphic operations –New paradigm

1. Independent ALU Ops R = R + XR * G = G + XG * B = B + XB * R R XR G = G + XG * B B XB

2. Adjacent Memory References R = R + X[i+0] G = G + X[i+1] B = B + X[i+2] R G = G + X[i:i+2] B

for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0] 3. Vectorizable Loops

for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3] for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0] A[i+1] = A[i+1] + B[i+1] A[i+2] = A[i+2] + B[i+2] A[i+3] = A[i+3] + B[i+3]

4. Partially Vectorizable Loops for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L)

4. Partially Vectorizable Loops for (i=0; i<16; i+=2) L0 L1 = A[i:i+1] – B[i:i+1] D = D + abs(L0) D = D + abs(L1) for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L) L = A[i+1] – B[i+1] D = D + abs(L)

Exploiting SLP with SIMD Execution Benefit: –Multiple ALU ops  One SIMD op –Multiple ld/st ops  One wide mem op Cost: –Packing and unpacking –Reshuffling within a register

Packing/Unpacking Costs C = A + 2 D = B + 3 C A 2 D B 3 = +

Packing/Unpacking Costs Packing source operands A B A = f() B = g() C = A + 2 D = B + 3 C A 2 D B 3 = +

Packing/Unpacking Costs Packing source operands Unpacking destination operands C D A = f() B = g() C = A + 2 D = B + 3 E = C / 5 F = D * 7 A B C A 2 D B 3 = +

Optimizing Program Performance To achieve the best speedup: –Maximize parallelization –Minimize packing/unpacking Many packing possibilities –Worst case: n ops  n! configurations –Different cost/benefit for each choice

A = B + C D = E + F Observation 1: Packing Costs can be Amortized Use packed result operands G = A - H I = D - J

Observation 1: Packing Costs can be Amortized Use packed result operands Share packed source operands A = B + C D = E + F G = B + H I = E + J A = B + C D = E + F G = A - H I = D - J

Observation 2: Adjacent Memory is Key Large potential performance gains –Eliminate ld/str instructions –Reduce memory bandwidth Few packing possibilities –Only one ordering exploits pre-packing

SLP Extraction Algorithm Identify adjacent memory references A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

SLP Extraction Algorithm Identify adjacent memory references A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B ABAB = X[i:i+1]

SLP Extraction Algorithm Follow def-use chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B ABAB = X[i:i+1]

SLP Extraction Algorithm Follow def-use chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B ABAB = X[i:i+1] HJHJ CDCD ABAB = -

SLP Extraction Algorithm Follow use-def chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B ABAB = X[i:i+1] HJHJ CDCD ABAB = -

SLP Extraction Algorithm Follow use-def chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B ABAB = X[i:i+1] CDCD EFEF 3535 = * HJHJ CDCD ABAB = -

SLP Extraction Algorithm Follow use-def chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B ABAB = X[i:i+1] CDCD EFEF 3535 = * HJHJ CDCD ABAB = -

SLP Availability

SLP vs. Vector Parallelism

Conclusions Multimedia architectures abundant –Need automatic compilation SLP is the right paradigm –20% non-vectorizable in SPEC95fp SLP extraction successful –Simple, local analysis –Provides speedups from 1.24 – 6.70 Found SLP in general-purpose codes