Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

Slides:

Advertisements

Similar presentations

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Towards Virtual Routers as a Service 6th GI/ITG KuVS Workshop on “Future Internet” November 22, 2010 Hannover Zdravko Bozakov.

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

Validating High-Level Synthesis Sudipta Kundu, Sorin Lerner, Rajesh Gupta Department of Computer Science and Engineering, University of California, San.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.

Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.

Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Automated Design of Custom Architecture Tulika Mitra

Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

QCAdesigner – CUDA HPPS project

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Introduction to Computer Organization Pipelining.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

Instruction Level Parallelism

Low-power Digital Signal Processing for Mobile Phone chipsets

Ph.D. in Computer Science

CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains

Please do not distribute

CDA 3101 Spring 2016 Introduction to Computer Organization

Vector Processing => Multimedia

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam

From C to Elastic Circuits

Tony Nowatzki Vinay Gangadhar Karthikeyan Sankaralingam

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

The Vector-Thread Architecture

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Mapping DSP algorithms to a general purpose out-of-order processor

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam University of Wisconsin-Madison 1

Motivation  SIMD (Single Instruction Multiple Data)  Exploits data level parallelism  Great for performance and energy  Examples: X86’s SSE/AVX, ARM’s NEON  Programming for SIMD  Assembly or compiler intrinsics  Preferred: Auto-vectorization  Compilers fail to auto-vectorize on most cases* How can we broaden the scope of vectorization? 2 * S. Maleki et al. An evaluation of vectorizing compilers. In PACT 2011

An Example //Needleman Wunsch int a[],b[]; //initialize for(int i =1; i < NCOLS; ++i) { for(int j = 1; j < NROWS; ++j) { a[i][j]=max(a[i-1][j-1]+b[i][j], a[i-1][j], a[i][j-1]); } //Needleman Wunsch int a[],b[]; //initialize for(int i =1; i < NCOLS; ++i) { for(int j = 1; j < NROWS; ++j) { a[i][j]=max(a[i-1][j-1]+b[i][j], a[i-1][j], a[i][j-1]); } Outer Iterations are dependent, too Use result of previous iteration 3 Dependence Chain Array a[][] Compilers do not map this code to SIMD unit

Memory/Vector Register File SIMD Shackles: Its Interface to Compiler Fixed mem. interface Fixed parallel datapath No control flow 4

Our Solution Flexible Microarchitecture (DySER) Memory/Vector Register File Flexible mem. interface Configurable datapath Control flow support 5

Our Solution Flexible Microarchitecture (DySER) Memory/Vector Register File Flexible Mem. Interface Configurable datapath Control flow support 6  Expose a flexible microarchitecture to compiler  Capture the complexity with AEPDG, a variant of PDG  Design and Implement a compiler with AEPDG

Executive Summary  Identify microarchitecture techniques required to tackle the challenges with vectorization  Develop AEPDG, a variant of program dependence graph to model the flexible microarchitecture.  Design and implement compiler optimizations with AEDPG.  Demonstrate how our solution broadens the scope of SIMD  Evaluate on throughput benchmarks and show that our solution outperforms ICC on SSE/AVX by 1.8x. 7

Outline  Introduction  Flexible Architecture: DySER  Access Execute PDG (AEPDG)  Compiler Design and Optimizations  Evaluation  Conclusion 8

DySER Overview DySER Circuit-switched array of functional units Integrated to processor pipeline Dynamically creates specialized datapath Fetch Decode Execute Memory WriteBack D$ I$ Register File Register File Decode Exec Units Exec Units 9 [HPCA 2011]

Flexible Microarchitecture: Configurable Datapath 10  Configure switches and functional units to create different datapath  Can specialize datapath  For ILP  For DLP  Allows the compiler to use DySER to accelerate variety of computation patterns

Flexible Microarchitecture: Control Flow Mapping 11 SS S SS S SS > S + S S - S SS S φ S  Predication  Predicates the output  A metabit in datapath propagates the validity of the data  “Select” function unit (PHI functions)  Selects valid input and forwards as its output in0in1 Out Pred. VV PHI V in1 Native control flow mapping allows compilers to use DySER to accelerate code with arbitrary control-flow

Flexible Microarchitecture: Decoupled Access/Execute Model  Memory access instructions execute in processor pipeline  Address Calculation, Loads, and Stores  Configure DySER  Send Data to DySER  Recv Data from DySER  Computation executes in DySER 12 Processor DySER Config ____________ x x

Flexible Microarchitecture: Decoupled Access/Execute Model 13 Input FIFO Output FIFO IP0 IP1 IP2IP3 OP0 OP1 OP2OP3  Processor sends data to DySER through its input FIFOs (input ports)  DySER computes in data flow fashion  Processor receives data from DySER through its output FIFOs (output ports) Decoupled Access/Execute model allows DySER to consume data in different order than how it is stored

Flexible Microarchitecture: Flexible Vector Interface 14 IP0 IP1 IP2IP Memory/ Vector register “Vector Port Mapping” Flexible vector interface allows compiler to use DySER to code with different memory access patterns (eg. Strided)

Outline  Introduction  Flexible Architecture: DySER  Access Execute PDG  Compiler Design and Optimizations  Evaluation  Conclusion 15

Access Execute Program Dependence Graph (AEPDG)  A variant of PDG  Edges represent both data and control dependence  Explicitly partitioned into access- PDG and execute-PDG subgraph  Edges between access and execute-PDG augmented with temporal information for (i = 0; i < N; ++i) if b[i] < 0: a = b[i] + 5; else: a = b[i] - 5; b[i]=a; for (i = 0; i < N; ++i) if b[i] < 0: a = b[i] + 5; else: a = b[i] - 5; b[i]=a; < < LD ST φ φ b+i < < φ φ 16

AEPDG and Flexible Interface 17 × × × × × × a[i] a[i+1] a[i+2] out[i] × × × × × × a[i] a[i+1] a[i+2] out[i] a[i+3] a[i+4] a[i+5] out[i+1] P0 P1 P2 P3 × × × × × × Vector Port (for a[]) Original AEPDGUnrolled AEPDG Vector Map Generation (Load/Store Coalescing) xxxxxxxx Vector Port Map P0x xxxxx P1P0xP1xxxP0P1P0P2P1P2xx Each edge on the interface knows its order

AEPDG and Flexible Interface 18 × × × × × × a[i] a[i+1] a[i+2] out[i] × × × × × × out[i+1] P0 P1 P2 P3 × × × × × × Vector Port (for a[]) Original AEPDGUnrolled AEPDG Vector Map Generation xxxxxxxx Vector Port Map P0xxx xxx P1xxP0P1xxP0P1P0P2P1P2xx a[i:i+5] Each edge on the interface knows its order

Outline  Introduction  Flexible Architecture: DySER  Access Execute PDG  Compiler Design and Optimizations  Evaluation  Conclusion 19

Compilation Tasks  Identify loops to specialize  Construct AEPDG  Access PDG  Execute PDG  Perform Optimizations  Vectorization, if possible  Schedule  Execute PDG to DySER  Access PDG to Core 20 Core aaa DySER Scheduling Optimization Execute Code Access Code Region Identification Partitioning Application

Small loops  We can again leverage loop properties.  Simply unroll the loop further, “Cloning” the region × × + + a[0] a[1] a[2] a[3] b[0] b[1] b[2] b[3] c[0] c[1] c[2] c[3] × × + + a[0] a[2] b[0] b[2] c[2] c[0] × × + + a[1] a[3] b[1] b[3] c[3] c[1] Input FIFO Output FIFO Execute Region BeforeAfter 21 Also Uses Flexible I/O

Large Loops: Sub-graph Matching  Subgraph Matching  Find Identical computations, split them out  Region Splitting  Configure multiple regions, quickly switch between them 22 Large Region Subgraph MatchingRegion Virtualization

Loop Dependence //Needleman Wunsch int a[],b[]; //initialize for(int i =1; i < NCOLS; ++i) { for(int j = 1; j < NROWS; ++j) { a[i][j]=max(a[i-1][j-1]+b[i][j], a[i-1][j], a[i][j-1]) } //Needleman Wunsch int a[],b[]; //initialize for(int i =1; i < NCOLS; ++i) { for(int j = 1; j < NROWS; ++j) { a[i][j]=max(a[i-1][j-1]+b[i][j], a[i-1][j], a[i][j-1]) } Outer Iterations are dependent, too + + max a[i-1][j-1] a[i-1][j] Use result of previous iteration a[i][j-1] a[i][j] + + max a[i-1][j] a[i-1][j+1] a[i][j+1] Vectorizable! 23 Dependence Chain Array a[] + + max + +

Outline  Introduction  Flexible Architecture: DySER  Access Execute PDG  Compiler Design and Optimizations  Evaluation  Conclusion 24

Evaluation Methodology  Implementation  Leverages LLVM compilation framework  Constructs AEPDG from LLVM-IR  Generates binary for x86, SPARC  Benchmarks  Throughput kernels  Parboil benchmarks  Simulation framework  Gem5 + DySER  Evaluated DySER with a 4-wide out-of-order core  Compared ICC auto-vectorizer to DySER compiler 25

SSE/AVX Vs. DySER 26 13x DySER bottlenecked by FDIV/FSQRT units DySER performs on average 1.8x better than SSE/AVX When DLP readily available, both SIMD and DySER perform better With control intensive code, DySER perform better

Programmer Optimized vs. Compiler Optimized 27 Outer Loop Transformations Different strategy for Reduction Constant Table Lookup Compiler generated code’s slowdown is only 30%

Conclusion  Broke the “SIMD Shackles” with flexible microarchitecture  Managed the complexity of the microarchitecture interface with AEPDG  Compilers build using AEPDG can broaden the scope of SIMD  Must rethink accelerators and their interface to compiler  Incremental changes to accelerators leads to diminishing returns 28

Questions? 29 DySER compiler available at