Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanović MIT Computer Science and Artificial Intelligence.

Slides:

Advertisements

Similar presentations

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

Advertisements

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.

Multiscalar processors

Automobiles The Scale Vector-Thread Processor Modern embedded systems Multiple programming languages and models Multiple distinct memories Multiple communication.

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.

Mark Hampton and Krste Asanović April 9, 2008 Compiling for Vector-Thread Architectures MIT Computer Science and Artificial Intelligence Laboratory University.

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.

Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.

RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Using Dynamic Binary Translation to Fuse Dependent Instructions Shiliang Hu & James E. Smith.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Full and Para Virtualization

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

Sunpyo Hong, Hyesoon Kim

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

My Coordinates Office EM G.27 contact time:

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

15-740/ Computer Architecture Lecture 3: Performance

Multiscalar Processors

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

The Vector-Thread Architecture

CS203 – Advanced Computer Architecture

Henk Corporaal TUEindhoven 2009

Vector Processing => Multimedia

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Advanced Computer Architecture

Henk Corporaal TUEindhoven 2011

Sampoorani, Sivakumar and Joshua

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

The Vector-Thread Architecture

Mattan Erez The University of Texas at Austin

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Dynamic Hardware Prediction

Loop-Level Parallelism

The University of Adelaide, School of Computer Science

Chapter 4 The Von Neumann Model

Presentation transcript:

Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA ISCA 2004 The Vector-Thread Architecture

Goals For Vector-Thread Architecture Primary goal is efficiency High performance with low energy and small area Take advantage of whatever parallelism and locality is available: DLP, TLP, ILP Allow intermixing of multiple levels of parallelism Programming model is key Encode parallelism and locality in a way that enables a complexity-effective implementation Provide clean abstractions to simplify coding and compilation

Vector and Multithreaded Architectures Vector processors provide efficient DLP execution Amortize instruction control Amortize loop bookkeeping overhead Exploit structured memory accesses Unable to execute loops with loop-carried dependencies or complex internal control flow Multithreaded processors can flexibly exploit TLP Unable to amortize common control overhead across threads Unable to exploit structured memory accesses across threads Costly memory-based synchronization and communication between threads PE0 Memory PE1PE2PEN vector control PE0 Memory PE1PE2PEN thread control Control Processor

Vector-Thread Architecture VT unifies the vector and multithreaded compute models A control processor interacts with a vector of virtual processors (VPs) Vector-fetch: control processor fetches instructions for all VPs in parallel Thread-fetch: a VP fetches its own instructions VT allows a seamless intermixing of vector and thread control Control Processor VP0 Memory VP1VP2VP3VPN thread- fetch vector-fetch

Outline Vector-Thread Architectural Paradigm Abstract model Physical Model SCALE VT Processor Evaluation Related Work

Virtual Processor Abstraction VPs contain a set of registers VPs execute RISC-like instructions grouped into atomic instruction blocks (AIBs) VPs have no automatic program counter, AIBs must be explicitly fetched VPs contain pending vector-fetch and thread-fetch addresses A fetch instruction allows a VP to fetch its own AIB May be predicated for conditional branch If an AIB does not execute a fetch, the VP thread stops thread- fetch thread-fetch AIB instruction VP thread execution fetch Registers VP ALUs vector-fetch

Virtual Processor Vector A VT architecture includes a control processor and a virtual processor vector Two interacting instruction sets A vector-fetch command allows the control processor to fetch an AIB for all the VPs in parallel Vector-load and vector-store commands transfer blocks of data between memory and the VP registers Registers VP0 ALUs Registers VP1 ALUs Registers VPN ALUs Control Processor Memory Vector Memory Unit vector-fetch vector -load vector -store

Cross-VP Data Transfers Cross-VP connections provide fine-grain data operand communication and synchronization VP instructions may target nextVP as destination or use prevVP as a source CrossVP queue holds wrap-around data, control processor can push and pop Restricted ring communication pattern is cheap to implement, scalable, and matches the software usage model for VPs Registers VP0 ALUs Registers VP1 ALUs Registers VPN ALUs crossVP- push Control Processor crossVP- pop vector-fetch crossVP queue

Mapping Loops to VT A broad class of loops map naturally to VT Vectorizable loops Loops with loop-carried dependencies Loops with internal control flow Each VP executes one loop iteration Control processor manages the execution Stripmining enables implementation-dependent vector lengths Programmer or compiler only schedules one loop iteration on one VP No cross-iteration scheduling

Vectorizable Loops Data-parallel loops with no internal control flow mapped using vector commands predication for small conditionals << vector-fetch vector-load x + ld vector-load ld vector-store st VP0 VP1 VP2 VP3VPN Control Processor << x + ld st << x + ld st << x + ld st << x + ld st << x + ld st vector-fetch vector-load ld vector-load ld loop iteration DAG operation

Loop-Carried Dependencies Loops with cross-iteration dependencies mapped using vector commands with cross-VP data transfers Vector-fetch introduces chain of prevVP receives and nextVP sends Vector-memory commands with non-vectorizable compute << x + ld st << x + vector-store st << x + st << x + st << x + st << x + st vector-load ld vector-load ld VP0 VP1 VP2 VP3VPN Control Processor ld vector-fetch

Loops with Internal Control Flow Data-parallel loops with large conditionals or inner-loops mapped using thread-fetches Vector-commands and thread-fetches freely intermixed Once launched, the VP threads execute to completion before the next control processor command ld vector-load ld vector-store st VP0 VP1 VP2 VP3VPN Control Processor ld st ld == br == br ld == br ld st == br ld st == br ld st == br ld == br ld st == br ld == br vector-fetch

VT Physical Model A Vector-Thread Unit contains an array of lanes with physical register files and execution units VPs map to lanes and share physical resources, VP execution is time-multiplexed on the lanes Independent parallel lanes exploit parallelism across VPs and data operand locality within VPs

Lane Execution Lanes execute decoupled from each other Command management unit handles vector-fetch and thread-fetch commands Execution cluster executes instructions in-order from small AIB cache (e.g. 32 instructions) AIB caches exploit locality to reduce instruction fetch energy (on par with register read) Execute directives point to AIBs and indicate which VP(s) the AIB should be executed for For a thread-fetch command, the lane executes the AIB for the requesting VP For a vector-fetch command, the lane executes the AIB for every VP AIBs and vector-fetch commands reduce control overhead 10s—100s of instructions executed per fetch address tag-check, even for non- vectorizable loops Lane 0 thread-fetch VP0 VP4 VP8 VP12 AIB tags vector-fetch ALU execute directive vector-fetch miss addr AIB fill miss VP AIB address AIB cache AIB instr.

VP Execution Interleaving Lane 3Lane 2Lane 1 VP0VP4VP8VP12 Lane 0 VP1VP5VP9VP13VP2VP6VP10VP14VP3VP7VP11VP15 time-multiplexing time Hardware provides the benefits of loop unrolling by interleaving VPs Time-multiplexing can hide thread-fetch, memory, and functional unit latencies AIB vector-fetch

VP Execution Interleaving Lane 3Lane 2Lane 1 VP0VP4VP8VP12 Lane 0 VP1VP5VP9VP13VP2VP6VP10VP14VP3VP7VP11VP15 vector-fetch AIB Dynamic scheduling of cross-VP data transfers automatically adapts to software critical path (in contrast to static software pipelining) No static cross-iteration scheduling Tolerant to variable dynamic latencies time-multiplexing time

SCALE Vector-Thread Processor SCALE is designed to be a complexity-effective all-purpose embedded processor Exploit all available forms of parallelism and locality to achieve high performance and low energy Constrained to small area (estimated 10 mm 2 in 0.18 μm) Reduce wire delay and complexity Support tiling of multiple SCALE processors for increased throughput Careful balance between software and hardware for code mapping and scheduling Optimize runtime energy, area efficiency, and performance while maintaining a clean scalable programming model

SCALE Clusters VPs partitioned into four clusters to exploit ILP and allow lane implementations to optimize area, energy, and circuit delay Clusters are heterogeneous – c0 can execute loads and stores, c1 can execute fetches, c3 has integer mult/div Clusters execute decoupled from each other Lane 0Lane 1Lane 2Lane 3 Control Processor L1 Cache AIB Fill Unit c0 c1 c2 c3 c0 c1 c2 c3 c0 c1 c2 c3 c0 c1 c2 c3 c0 c1 c2 c3 SCALE VP

VP24 VP12 SCALE Registers and VP Configuration Number of VP registers in each cluster is configurable The hardware can support more VPs when they each have fewer private registers Low overhead: Control processor instruction configures VPs before entering stripmine loop, VP state undefined across reconfigurations cr0cr1 Atomic instruction blocks allow VPs to share temporary state – only valid within the AIB VP general registers divided into private and shared Chain registers at ALU inputs – avoid reading and writing general register file to save energy VP0 VP4 VP8 VP12 VP16 VP20 VP4 VP8 4 VPs with 0 shared regs 8 private regs shared 7 VPs with 4 shared regs 4 private regs 25 VPs with 7 shared regs 1 private reg c0 shared VP0 VP4 VP8 shared

SCALE Micro-Ops Assembler translates portable software ISA into hardware micro-ops Per-cluster micro-op bundles access local registers only Inter-cluster data transfers broken into transports and writebacks Cluster 0Cluster 1Cluster 2 wbcomputetpwbcomputetpwbcomputetp xor pr0,pr1  c1,c2c0  cr0 sll cr0,2  c2c0  cr0 c1  cr1add cr0,cr1  pr0 clusteroperationdestinations c0 xor pr0, pr1  c1/cr0, c2/cr0 c1 sll cr0, 2  c2/cr1 c2 add cr0, cr1  pr0 Software VP code: Hardware micro-ops: cluster micro-op bundleCluster 3 not shown

SCALE Cluster Decoupling Cluster execution is decoupled Cluster AIB caches hold micro-op bundles Each cluster has its own execute- directive queue, and local control Inter-cluster data transfers synchronize with handshake signals Memory Access Decoupling (see paper) Load-data-queue enables continued execution after a cache miss Decoupled-store-queue enables loads to slip ahead of stores AIB cache ALU VP Regs compute writeback transport AIBs compute writeback transport Cluster 2 Cluster 3 AIBs compute writeback transport Cluster 1 AIBs compute writeback transport Cluster 0

SCALE Prototype and Simulator Prototype SCALE processor in development Control processor: MIPS, 1 instr/cycle VTU: 4 lanes, 4 clusters/lane, 32 registers/cluster, 128 VPs max Primary I/D cache: 32 KB, 4x128b per cycle, non-blocking DRAM: 64b, 200 MHz DDR2 (64b at 400Mb/s: 3.2GB/s) Estimated 10 mm 2 in 0.18μm, 400 MHz (25 FO4) Cycle-level execution-driven microarchitectural simulator Detailed VTU and memory system model

Benchmarks Diverse set of 22 benchmarks chosen to evaluate a range of applications with different types of parallelism 16 from EEMBC, 6 from MiBench, Mediabench, and SpecInt Hand-written VP assembly code linked with C code compiled for control processor using gcc Reflects typical usage model for embedded processors EEMBC enables comparison to other processors running hand- optimized code, but it is not an ideal comparison Performance depends greatly on programmer effort, algorithmic changes are allowed for some benchmarks, these are often unpublished Performance depends greatly on special compute instructions or sub-word SIMD operations (for the current evaluation, SCALE does not provide these) Processors use unequal silicon area, process technologies, and circuit styles Overall results: SCALE is competitive with larger and more complex processors on a wide range of codes from different domains See paper for detailed results Results are a snapshot, SCALE microarchitecture and benchmark mappings continue to improve

Mapping Parallelism to SCALE Data-parallel loops with no complex control flow Use vector-fetch and vector-memory commands EEMBC rgbcmy, rgbyiq, and hpg execute 6-10 ops/cycle for 12x-32x speedup over control processor, performance scales with number of lanes Loops with loop-carried dependencies Use vector-fetched AIBs with cross-VP data transfers Mediabench adpcm.dec: two loop-carried dependencies propagate in parallel, on average 7 loop iterations execute in parallel, 8x speedup MiBench sha has 5 loop-carried dependencies, exploits ILP Speedup vs. Control Processor Lane 2 Lanes 4 Lanes 8 Lanes adpcm.decsha prototype rgbcmyrgbyiqhpg

Mapping Parallelism to SCALE Data-parallel loops with large conditionals Use vector-fetched AIBs with conditional thread-fetches EEMBC dither: special-case for white pixels (18 ops vs. 49) Data-parallel loops with inner loops Use vector-fetched AIBs with thread-fetches for inner loop EEMBC lookup: VPs execute pointer-chaising IP address lookups in routing table Free-running threads No control processor interaction VP worker threads get tasks from shared queue using atomic memory ops EEMBC pntrch and MiBench qsort achieve significant speedups Speedup vs. Control Processor ditherlookupqsortpntrch 1 Lane 2 Lanes 4 Lanes 8 Lanes

Comparison to Related Work TRIPS and Smart Memories can also exploit multiple types of parallelism, but must explicitly switch modes Raw’s tiled processors provide much lower compute density than SCALE’s clusters which factor out instruction and data overheads and use direct communication instead of programmed switches Multiscalar passes loop-carried register dependencies around a ring network, but it focuses on speculative execution and memory, whereas VT uses simple logic to support common loop types Imagine organizes computation into kernels to improve register file locality, but it exposes machine details with a low-level VLIW ISA, in contrast to VT’s VP abstraction and AIBs CODE uses register-renaming to hide its decoupled clusters from software, whereas SCALE simplifies hardware by exposing clusters and statically partitioning inter-cluster transport and writeback ops Jesshope’s micro-threading is similar in spirit to VT, but its threads are dynamically scheduled and cross-iteration synchronization uses full/empty bits on global registers

Summary The vector-thread architectural paradigm unifies the vector and multithreaded compute models VT abstraction introduces a small set of primitives to allow software to succinctly encode parallelism and locality and seamlessly intermingle DLP, TLP, and ILP Virtual processors, AIBs, vector-fetch and vector-memory commands, thread-fetches, cross-VP data transfers SCALE VT processor efficiently achieves high- performance on a wide range of embedded applications