Scalable Vector Coprocessor for Media Processing Christoforos Kozyrakis ( ) IRAM Project Retreat, July 12 th, 2000.

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

Computer Organization and Architecture

CSCI 4717/5717 Computer Architecture

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

The University of Adelaide, School of Computer Science

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.

CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.

Instruction Level Parallelism (ILP) Colin Stevens.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Chapter 17 Parallel Processing.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Multiscalar processors

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.

SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Introduction to Computer Organization Pipelining.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

CS 352H: Computer Systems Architecture

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Simultaneous Multithreading

CS203 – Advanced Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 4 17th May, 2006

Henk Corporaal TUEindhoven 2011

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

The Vector-Thread Architecture

Levels of Parallelism within a Single Processor

CSC3050 – Computer Architecture

CMSC 611: Advanced Computer Architecture

The University of Adelaide, School of Computer Science

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

Scalable Vector Coprocessor for Media Processing Christoforos Kozyrakis ( ) IRAM Project Retreat, July 12 th, 2000

Vector CoprocessorsIRAM Retreat, Summer 2000 This Presentation A direction for future work on vector coprocessors –Motivated by work on VIRAM-1 –My approach to scalable vector architectures Krste’s thesis was not the end of it Looking to motivate heated discussions and get some early feedback This is a short presentation –Several details omitted or still unknown –Qualitative arguments available for now; Quantitative data will follow in the future Familiarity with the VIRAM-1 (or some other vector) architecture is not necessary but it is useful…

Vector CoprocessorsIRAM Retreat, Summer 2000 Outline Key assumptions The goal – An architecture platform for scalable vector coprocessors Inefficiencies of the VIRAM architecture Scalable architecture overview Discussion of few important architecture issues –Register discovery –Cluster assignment –Memory latency –Vector chaining Other architecture issues

Vector CoprocessorsIRAM Retreat, Summer 2000 Assumptions Media processing is important Vector processing is a good much for media processing There is no single optimal chip –Media applications have a wide range of performance, power, and cost requirements –Have to address scaling and customization issues Software is the “king” –HLL/compiler based software development –Software compatibility among chips is important Useful guidelines –Locality (to avoid interconnect scaling issues) –Modularity (to decrease design time) –Simplicity

Vector CoprocessorsIRAM Retreat, Summer 2000 The Goal An architecture platform for vector coprocessors that: –Is efficient for media processing (performance, power, area, complexity) –Is scalable and customizable: processing power, area, cost, and complexity can be adapted to a specific application domain –Binary compatibility between the various implementations –Works well with variety of main processor architectures targeting different types of parallelism –Works well with a variety of memory systems Data Irregular ILP Thread Multi-programming Parallelism Levels MT? SMT? CMP? VLIW? Superscalar? VECTOR MPP? NOW? Efficient Solution

Vector CoprocessorsIRAM Retreat, Summer 2000 Inefficiencies of the VIRAM Architecture Scaling by allocating more vector lanes –Large scaling steps –Requires long vectors for efficiency and/or puts pressure on instruction issue bandwidth –Fixed number of functional units, non optimal datapath use Scaling by adding functional units to the lanes –Lane must be redesigned –Register file complexity (2-3R/1W ports per FU) –The area, delay, and power of a register file for N functional units grow by N 3, N 3/2, and N 3 respectively Dependence to memory system details –Control and lanes are designed around the specific memory system Not well suited for a multi-issue or multi-threaded scalar core

Vector CoprocessorsIRAM Retreat, Summer 2000 Scalable Vector Architecture

Vector CoprocessorsIRAM Retreat, Summer 2000 The Microarchicture Execution clusters (N) –A small, simple vector processor without a memory system that implements some subset of the ISA –1 or 2 functional units (64b datapaths?) –An instruction queue –A few local vector registers for temporary results (4 to 8?) The architecture state cluster (1) –Global vector register file (32+ registers) The memory cluster (1) –Interface to the memory system; Memory system details exposed here –A few local vector registers for decoupling or software speculation support (4 to 8?)

Vector CoprocessorsIRAM Retreat, Summer 2000 The Microarchicture Vector issue logic (1) –Issues instructions to clusters –It does not handle chaining or scheduling –The number/mix of clusters and the details of the main processor are exposed here Cluster data interconnect (1) –Moves data between the various clusters –Anything from a simple bus to full crossbar Control bus (1) –For issuing instructions and transfers to the clusters

Vector CoprocessorsIRAM Retreat, Summer 2000 Why Clusters? Benefits –Performance/area/power = f(# of clusters, mix of clusters, type and BW of cluster interconnect) –Reduced complexity within each cluster (small register file, simple datapaths, simple control, all local interconnect) –Reduced complexity for global register file (few ports) –Modularity by cluster design reuse –Instruction classes with different characteristics can be separated –No need for single synchronous clock across clusters Potential disadvantage: inter-cluster communication –Cycles used moving data between clusters –Cost of required cluster data interconnect

Vector CoprocessorsIRAM Retreat, Summer 2000 Inter-cluster Communication Should be infrequent: –Streaming nature of multimedia applications Most temporary results used once –Clusters can be assigned independent instructions Instructions from different iterations of the outer-loop, from different loops, or from different threads –Clusters of different types rarely communicate (e.G. Integer and floating-point clusters) Critical issues to work on –Assignment of instructions to clusters –Code scheduling for such an architecture

Vector CoprocessorsIRAM Retreat, Summer 2000 Issue 1: Register Discovery Within a cluster –Source registers may be local or coming from the interconnect –The result is written in a local register At issue time in VIL –Keep track of architectural register’s true location with register renaming hardware –If a source register not local to the cluster to execute the instruction, initiate an inter-cluster transfer –If there is no available local register for the result in the cluster, initiate a transfer from a local to a global register to make up space –Note: single issue is enough if each vector instruction occupies a functional unit for multiple cycles Keep cluster datapaths narrow

Vector CoprocessorsIRAM Retreat, Summer 2000 Issue 2: Cluster assignment Simple if only one cluster can handle this instruction If multiple clusters available, decide based: –Location of source operands –Availability of local register for result –How busy the candidate clusters are –Software hints (e.g. thread of execution) Need experimental work to determine which policies work best

Vector CoprocessorsIRAM Retreat, Summer 2000 Issue 3: Memory Latency Use local registers in memory cluster for decoupling Each load is decomposed to –A load into a local register in the memory cluster –A (later) move from the memory cluster to a global/local register Each store is decomposed to –A move from some cluster to register in the memory cluster –A store from the local register to the memory system –“Store & deallocate” should be a useful instruction

Vector CoprocessorsIRAM Retreat, Summer 2000 Issue 4: Vector Chaining If all sources are local within a cluster –Just like in a non-clustered vector architecture If some sources are non-local –Chaining rate is dictated by non-local data arrival –If data for the next element operation have arrived, execute the corresponding operation (simple control) Due to simplicity of each cluster and independence from memory latency, density-time implementations of conditional operations are easy to combine with full chaining

Vector CoprocessorsIRAM Retreat, Summer 2000 Other Issues (1) Optimal configurations for various application domains Clusters organization –Number, type, and mix of clusters (integer/FP/mixed) –Number and width of functional units –Number of local registers per cluster –Instruction queue size, need of queues in CDI inputs/outputs etc Memory cluster –Number of local registers –Organization (number of address generators, number of pending accesses etc) Architecture state cluster –Number of register file ports

Vector CoprocessorsIRAM Retreat, Summer 2000 Other issues (2) Data cluster interconnect –Type (bus, other), bandwidth, protocol (packet-based?) –Synchronization of plesio-synchronous clusters Code scheduling for a clustered vector architecture –Effect on inter-cluster communication frequency –Handling run-time or loop constants (replication in hardware or software) Support for speculation in software Coprocessor interface enhancements Memory system optimizations for vector coprocessors –Several options available as well –A very large issue to include in this presentation Pick a good name (preferably from Greek mythology)

Vector CoprocessorsIRAM Retreat, Summer 2000 Backup slides

Vector CoprocessorsIRAM Retreat, Summer 2000 VIRAM Prototype Architecture MIPS64™ 5Kc Core Inst. CacheData Cache CP IF FPU Vector Register File (8KB) Flag Register File (512B) Flag Unit 0 Memory Unit DMA 32B Memory Crossbar 32B 8B DRAM0 (2MB) DRAM1 (2MB) DRAM7 (2MB) … SysAD IF 8B Arithmetic Unit 0 Arithmetic Unit 1 Flag Unit 1 JTAG JTAG IF TLB

Vector CoprocessorsIRAM Retreat, Summer 2000 Delayed Vector Pipeline Random access latency included in the vector unit pipeline Arithmetic operations and stores are delayed to shorten RAW hazards Long hazards eliminated for the common loop cases Vector pipeline length: 15 stages FDREM ATVW ATVR VLD VST VADD DRAM latency: >25ns Load  Add RAW hazard VRVWVXDELAY vld vadd vst vld vadd vst W

Vector CoprocessorsIRAM Retreat, Summer 2000 Modular Vector Unit Design Single 64b “lane” design replicated 4 times –Reduces design and testing time –Provides a simple scaling model (up or down) without major control or datapath redesign Most instructions require only intra-lane interconnect –Tolerance to interconnect delay scaling 32B Control 8B Xbar IF Integer Datapath 0 Flag Reg. Elements & Datapaths Vector Reg. Elements FP Datapath Integer Datapath 1 8B Xbar IF Integer Datapath 0 Flag Reg. Elements & Datapaths Vector Reg. Elements FP Datapath Integer Datapath 1 8B Xbar IF Integer Datapath 0 Flag Reg. Elements & Datapaths Vector Reg. Elements FP Datapath Integer Datapath 1 8B Xbar IF Integer Datapath 0 Flag Reg. Elements & Datapaths Vector Reg. Elements FP Datapath Integer Datapath 1

Vector CoprocessorsIRAM Retreat, Summer 2000 VIRAM-1 Floorplan

Vector CoprocessorsIRAM Retreat, Summer 2000 Short Vectors Very common in media applications –Block based algorithms (e.g. Mpeg), short filters etc Outer-loop vectorization –Not always available (loop-carried dependencies, irregular outer- loop, short outer-loops) –Requires more sophisticated compiler technology –May turn sequential accesses into strided/indexed