Mark Hampton and Krste Asanović April 9, 2008 Compiling for Vector-Thread Architectures MIT Computer Science and Artificial Intelligence Laboratory University.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
© 2009 IBM Corporation July, 2009 | PADTAD Chicago, Illinois A Proposal of Operation History Management System for Source-to-Source Optimization.
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
GridFlow: Workflow Management for Grid Computing Kavita Shinde.
TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.
Vertically Integrated Analysis and Transformation for Embedded Software John Regehr University of Utah.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Multiscalar processors
Parallelizing Compilers Presented by Yiwei Zhang.
Automobiles The Scale Vector-Thread Processor Modern embedded systems Multiple programming languages and models Multiple distinct memories Multiple communication.
From Essentials of Computer Architecture by Douglas E. Comer. ISBN © 2005 Pearson Education, Inc. All rights reserved. 7.2 A Central Processor.
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Basics and Architectures
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
CUDA - 2.
Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanović MIT Computer Science and Artificial Intelligence.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
Pipelining and Parallelism Mark Staveley
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Sunpyo Hong, Hyesoon Kim
Memory-Aware Compilation Philip Sweany 10/20/2011.
My Coordinates Office EM G.27 contact time:
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
1 Lecture 5a: CPU architecture 101 boris.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Multiscalar Processors
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
The Vector-Thread Architecture
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Henk Corporaal TUEindhoven 2009
Simultaneous Multithreading in Superscalar Processors
Topic 5: Processor Architecture Implementation Methodology
Advanced Computer Architecture
Henk Corporaal TUEindhoven 2011
Topic 5: Processor Architecture
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Samuel Larsen and Saman Amarasinghe, MIT CSAIL
The Vector-Thread Architecture
HARP Control Divergence & Assignment 4
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science
Loop-Level Parallelism
Presentation transcript:

Mark Hampton and Krste Asanović April 9, 2008 Compiling for Vector-Thread Architectures MIT Computer Science and Artificial Intelligence Laboratory University of California at Berkeley

Vector-thread (VT) architectures efficiently encode parallelism in a variety of applications A VT architecture unifies the vector and multithreaded execution models The Scale VT architecture exploits DLP, TLP, and ILP (with clustering) simultaneously Previous work [Krashinsky04] has shown the ability of Scale to take advantage of the parallelism available in several different types of loops ­However, that evaluation relied on mapping code to Scale using handwritten assembly

This work presents a back end code generator for the Scale architecture Compiler infrastructure is relatively immature, as much of the work to this point consisted of getting all the pieces to run together We prioritized taking advantage of Scale’s unique features to enable support for “difficult” types of loops rather than focusing on optimizations ­Compiler can parallelize loops with internal control flow, outer loops, loops with cross-iteration dependences ­However, compiler does not currently handle while loops Despite lack of optimizations, compiler is still able to produce some significant speedups

Talk outline Vector-thread architecture background Compiler overview ­Emphasis on how code is mapped to Scale Performance evaluation Conclusions

Vector-thread architectures use a virtual processor (VP) abstraction VPs contains registers and ALUs VPs execute RISC-like instructions grouped into atomic instruction blocks (AIBs) AIBs must be explicitly fetched ­Can either be fetched for a group of VPs (vector-fetch) or for a single VP (thread-fetch) Fetches can be predicated to allow conditional branching A VP stops after it executes an AIB that does not issue a fetch instruction thread-fetch VP Registers ALUs vector-fetch (p) fetch fetch instruction vector- fetched AIB thread- fetched AIBs

thread-fetch VP0 Registers ALUs vector-fetch thread-fetch VP1 Registers ALUs vector-fetch thread-fetch VPN vector-fetch Control Processor Memory Vector Memory Unit vector-fetch vector- load vector- store cross-VP queue Registers ALUs A control processor interacts with a vector of virtual processors

The Scale processor prototype implements the vector-thread architectural paradigm Scale is a high-performance, energy-efficient embedded design [Krashinsky07]

Scale excels at exploiting loop-level parallelism Typical programming model is to have control processor launch a group of VPs, with each VP executing a single iteration The ability of VPs to direct their own control flow and to use the cross-VP network enables support for a wider variety of loop types than traditional vector designs Ability to support vector execution in data-parallel code sections enables a higher degree of performance and energy efficiency than a traditional multithreaded design

The compiler for Scale ties together three existing infrastructures SUIF Front End C Source Code Memory Dependence Analysis SUIF-to-Trimaran Conversion Classical Optimizations Scalar-to-VP Code Transformation Prepass Instruction Scheduling Register Allocation Postpass Instruction Scheduling Cluster Assignment AIB Formation Chain Register Insertion Assembly Code Generation GCC Cross Compilation Binary Executable SUIF Trimaran GCC

The compiler conducts a dependence analysis to select which loop to parallelize SUIF’s dependence library is used to annotate memory operations with direction vectors The restrict keyword is required to indicate there is no aliasing ­This is the extent of manual programmer intervention Trimaran uses the results of the SUIF analysis to detect whether a particular loop in a nest has any cross-iteration dependences ­Priority is given to parallelizing innermost DOALL loops ­If a loop nest contains no DOALL loops, the compiler tries to parallelize a DOACROSS loop

Header Block: Vector-fetched code, VTU commands, scalar instructions Back Edge/Exit Block: Vector- fetched code, VTU commands, scalar instructions Internal Loop Blocks: Thread- fetched code Loop Entry Loop Exit Once a loop is selected, it is mapped to the VTU without any restructuring Any established front end loop transformation can also be used, but that doesn’t change the back end code generation strategy

Simple DOALL loops are handled similarly to traditional vectorization for (i = 0; i < len; i++) out[i] = COEFF*in1[i] + in2[i]; loop: lw r1, in1 mult r2, r0, r1 lw r3, in2 add r4, r2, r3 sw r4, out add in1, 4 add in2, 4 add out, 4 sub len, 1 bnez len, loop Compiler tasks: Add a command to configure the VTU Strip mine the loop Map scalar code to VTU code Propagate loop-invariant values to shared registers li r0, COEFF

Simple DOALL loops are handled similarly to traditional vectorization for (i = 0; i < len; i++) out[i] = COEFF*in1[i] + in2[i]; loop: lw r1, in1 mult r2, r0, r1 lw r3, in2 add r4, r2, r3 sw r4, out add in1, 4 add in2, 4 add out, 4 sub len, 1 bnez len, loop loop: setvl r6, len vlw v0, in1 vmult v1, v0, s0 vlw v2, in2 vadd sd0, v1, v2 vsw sd0, out sll r7, r6, 2 add in1, r7 add in2, r7 add out, r7 sub len, r6 bnez len, loop vcfgvl r5, 128,... vwrsh s0, COEFF li r0, COEFF

Internal control flow can be handled by allowing VPs to fetch their own code for (i = 0; i < len; i++) { if (in[i] < 4) temp = in[i] * 4; else temp = in[i] * 2; out[i] = temp; } loop: lw r0, in slt r1, r0, 4 bnez r1, b3 b2: sll r2, r0, 1 j b4 b4: sw r2, out # bookkeeping code... bnez len, loop b3: sll r2, r0, 2 Additional compiler tasks beyond simple DOALL case: Map branches and fall- through paths to VP fetches –Place AIB addresses in shared regs as optimization Compute induction variable values used in internal loop blocks (not required for this example)

Internal control flow can be handled by allowing VPs to fetch their own code for (i = 0; i < len; i++) { if (in[i] < 4) temp = in[i] * 4; else temp = in[i] * 2; out[i] = temp; } loop: lw r0, in slt r1, r0, 4 bnez r1, b3 b2: sll r2, r0, 1 j b4 b4: sw r2, out # bookkeeping code... bnez len, loop b3: sll r2, r0, 2 vcfgvl r3, 128,... vwrsh s0, b2 vwrsh s1, b3 loop: setvl r4, len vlw v0, in vslt p, v0, 4 psel.fetch s1, s0 b2: vsll sd0, v0, 1 b4: vsw sd0, out # bookkeeping code... bnez len, loop b3: vsll sd0, v0, 2

Internal control flow can be handled by allowing VPs to fetch their own code for (i = 0; i < len; i++) { if (in[i] < 4) temp = in[i] * 4; else temp = in[i] * 2; out[i] = temp; } vcfgvl r3, 128,... vwrsh s0, b2 vwrsh s1, b3 loop: setvl r4, len vlw v0, in vslt p, v0, 4 psel.fetch s1, s0 b2: vsll sd0, v0, 1 b4: vsw sd0, out # bookkeeping code... bnez len, loop b3: vsll sd0, v0, 2 Although example is simple, it illustrates how compiler is able to map complex control flow to VPs No need to execute both sides of a branch and throw away one set of results However, it is possible to perform if-conversion (although that is not currently implemented)

The ability of VPs to direct their control flow allows outer loop parallelization for (i = 0; i < len; i++) { sum = 0; for (j = 0; j < len-i; j++) sum += in[j] * in[j+i]; out[i] = sum; } loop1: li r0, 0 sub r1, len, i move r2, in sll r3, i, 2 add r4, r3, in loop2: lw r5, r2 lw r6, r4 mult r7, r5, r6 add sum, r7 # bookkeeping code... bnez r1, loop2 sw sum, out # bookkeeping code... bnez len, loop1 Compiler has same tasks as in previous case: New aspect illustrated by this example is need to compute induction variables in internal loop blocks No need to perform loop interchange or unrolling

The ability of VPs to direct their control flow allows outer loop parallelization for (i = 0; i < len; i++) { sum = 0; for (j = 0; j < len-i; j++) sum += in[j] * in[j+i]; out[i] = sum; } loop1: li r0, 0 sub r1, len, i move r2, in sll r3, i, 2 add r4, r3, in loop2: lw r5, r2 lw r6, r4 mult r7, r5, r6 add sum, r7 # bookkeeping code... bnez r1, loop2 sw sum, out # bookkeeping code... bnez len, loop1 vcfgvl r8, 128,... vwrsh s0, len vwrsh s1, in la r9, vp_numbers vlb v0, r9 loop1: setvl r10, len vwrsh s2, i vadd v1, s2, v0... loop2: vplw bnez len, loop1

Loop-carried dependences can be mapped to the cross-VP network for (i = 1; i < len; i++) out[i] = in[i] * out[i-1]; loop: lw r1, in mult r0, r1 sw r0, out add in, 4 add out, 4 sub len, 1 bnez len, loop sub len, 1 lw r0, -4(out) Additional compiler tasks beyond simple DOALL case: Insert commands to push initial value into cross-VP network and to pop final value Map loop-carried values to prevVP/nextVP queues in VP code Copy any cross-VP queue values that have more than one reader to registers

Loop-carried dependences can be mapped to the cross-VP network for (i = 1; i < len; i++) out[i] = in[i] * out[i-1]; loop: setvl r4, len vlw v0, in vmult v1, v0, prevVP vmove sd0, v1 vmove nextVP, v1 vsw sd0, out # bookkeeping code... bnez len, loop sub len, 1 lw r0, -4(out) vcfgvl r2, 128,... xvppush r3, x0 xvppop r5, x0 loop: lw r1, in mult r0, r1 sw r0, out add in, 4 add out, 4 sub len, 1 bnez len, loop sub len, 1 lw r0, -4(out)

The compiler focuses on improving throughput rather than reducing single-thread latency Various phases are aimed at minimizing physical register usage ­Cluster assignment attempts to balance work even at the expense of inter-cluster moves ­Instruction scheduling tries to pack dependence chains together ­Chain register insertion is designed to avoid using the register file for short-lived values Additional details in paper

Evaluation methodology Scale simulator uses detailed models for VTU and cache, but a single-instruction-per- cycle latency for control processor ­Reduces magnitude of parallelized code speedups Performance is evaluated across a limited number of EEMBC benchmarks ­EEMBC benchmarks are difficult to automatically parallelize ­Continued improvements to the compiler infrastructure (e.g. if-conversion, front end loop transformations) would enable broader benchmark coverage

The speedups of (relatively) unoptimized code reflect Scale’s advantages Speedups exceed or are comparable to those observed in a limit study [Islam07] performed for an idealized 16-core multiprocessor supporting thread- level speculation; same is true for an infinite number of cores Results point to benefits of exploiting parallelism within a single core More accurately ~11x

There is a variety of related work TRIPS also exploits multiple forms of parallelism, but the compiler’s focus is on forming blocks of useful instructions and mapping instructions to ALUs Stream processing compilers share some similarities with our approach, but also have somewhat different priorities, such as managing the utilization of the Stream Register File IBM’s Cell compiler has to deal with issues such as alignment and branch hints, which are not present for Scale GPGPU designs (Nvidia’s CUDA, AMD’s Stream Computing) also have similarities with Scale, but the differences in the programming models result in different focuses in the compilers

Concluding remarks Vector-thread architectures exploit multiple forms of parallelism This work presented a compiler for the Scale vector-thread architecture The compiler can parallelize a variety of loop types Significant performance gains were achieved over a single-issue scalar processor

A comparison to handwritten code shows there is still significant room for improvement There are several optimizations that can be employed to narrow the performance gap