PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

CSCI 4717/5717 Computer Architecture
The University of Adelaide, School of Computer Science
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Chapter 9 Computer Design Basics. 9-2 Datapaths Reminding A digital system (or a simple computer) contains datapath unit and control unit. Datapath: A.
Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Processor Types And Instruction Sets Barak Perelman CS147 Prof. Lee.
Pipelining By Toan Nguyen.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Computer Architecture and Organization Introduction.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Archs, VHDL 3 Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
Chapter 1 Introduction. Architecture & Organization 1 Architecture is those attributes visible to the programmer —Instruction set, number of bits used.
Advanced Computer Architecture 0 Lecture # 1 Introduction by Husnain Sherazi.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY
Computer Organization. This module surveys the physical resources of a computer system.  Basic components  CPU  Memory  Bus  I/O devices  CPU structure.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
Exploiting Parallelism
Lecture on Central Process Unit (CPU)
Reconfigurable architectures ESE 566. Outline Static and Dynamic Configurable Systems –Static SPYDER, RENCO –Dynamic FIREFLY, BIOWATCH PipeRench: Reconfigurable.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
1 Advanced Digital Design Reconfigurable Logic by A. Steininger and M. Delvai Vienna University of Technology.
Fundamentals of Programming Languages-II
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Introduction to Computer Organization Pipelining.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
My Coordinates Office EM G.27 contact time:
Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
Reconfigurable Computing1 Reconfigurable Computing Part II.
Computer Architecture Furkan Rabee
Topics Coarse-grained FPGAs. Reconfigurable systems.
Computer Design Basics
COMP541 Datapaths I Montek Singh Mar 28, 2012.
Embedded Systems Design
Chapter 9 – Real Memory Organization and Management
Morgan Kaufmann Publishers The Processor
CDA 3101 Spring 2016 Introduction to Computer Organization
Parallel and Multiprocessor Architectures
Pipelining and Vector Processing
STUDY AND IMPLEMENTATION
Multivector and SIMD Computers
COMP541 Datapaths I Montek Singh Mar 18, 2010.
Computer Design Basics
ECE 352 Digital System Fundamentals
Chapter 12 Pipelining and RISC
ARM ORGANISATION.
COMPUTER ORGANIZATION AND ARCHITECTURE
Presentation transcript:

PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University

Introduction Multimedia workloads increasingly emphasize relatively simple calculations on massive quantities of mixed-width data Underutilizes the processing strengths of conventional processors in two important respects: 1.Size of data elements underutilizes the processor’s wide datapath –Multimedia extensions and SIMD instructions attempt to address this defficiency (see SIMD later) 2.Instruction bandwidth is much higher than is needed for regular dataflow-dominated computations on large datasets –Renewed interest in vector processing (see IRAM later)

Reconfigurable Computing A fundamentally different way is to configure connections between programmable logic elements and registers to construct an efficient, highly parallel implementation of the processing kernel The interconnected network of processing elements is called a reconfigurable fabric The dataset used to program the interconnect and processing elements is a configuration The advantages of this approach, known as reconfigurable computing is that no further instruction download is required after a configuration is loaded and the right combination of simple processing elements can be combined to match the requirements of the computations

Reconfigurable Computing Challenges Picking the right logic granularity Living with hard constraints Minimising configuration overheads Finding appropriate mappings Excessive compilation times Providing forward compatibility

PipeRench solves some of these problems PipeRench uses a technique called pipeline reconfiguration to solve the problems of compilability, reconfiguration time, and forward compatability Architectural parameters, including logic block granularity, have been chosen to optimize performance for a suite of multimedia kernels PipeRench is claimed to balance the needs of the compiler against the design realities of deep sub- micron process technology

Attributes of Target Kernels Reconfigurable fabrics can provide significant benefits for functions with one or more of the following features: 1.The function operates on bit-widths that differ from processor’s native word size 2.Data dependencies in the function allow multiple function units to operate in parallel 3.The function is composed of a series of basic operations that can be combined into a single specialized operation 4.The function can be pipelined 5.Constant propagation can be performed, reducing the complexity of operations 6.The input values are reused many times within the computation

Two broad categories emerge… Stream-based functions process a large data input stream and produce a large output stream Custom instructions take a few inputs and produce a few outputs

Stream-based example Code for a FIR filter and a pipelined version for a three-tap filter

Performance on 8-bit FIR filters

Custom instruction example The reconfigurable computing solution replaces the O(n) loop with an adder tree of height O(log n)

Configuration time & Communication latency If the previous popCount function is called just once it may not be worth configuring the fabric because the time needed to configure the function exceeds the benefit obtained from executing the function on the fabric If the function is used outside of a loop, and its results are to be used immediately, the fabric needs direct access to the processor’s registers On the other hand if the function is used in a loop with no immediate dependencies on the results, performance can be improved by providing the fabric with direct access to memory

Possible placements for reconfigurable fabrics

Pipelined Reconfigurable Architectures We have seen how application-specific configurations can be used to accelerate applications The static nature of these configurations causes problems if 1.The computation requires more hardware than is available, and 2.The configuration doesn’t exploit the additional resources that will ineviatbly become available in future process generations Pipeline reconfiguration allows a large logical design to be implemented on a small piece of hardware through rapid reconfiguration of that hardware

Pipeline Reconfiguration This diagram illustrates the process of virtualizing a five-stage pipeline on a three-stage device

Benefits of pipeline reconfiguration Pipeline reconfiguration breaks the single static configuration into pieces that correspond to pipeline stages – these are then loaded, one per cycle, into the fabric. Computation proceeds even though the whole configuration is never present at one time With this technique, the compiler is no longer responsible for satisfying fixed hardware constraints In addition, the performance of the design improves in proportion to the amount of hardware allocated to that design; as future process technology makes more transistors available, the same hardware designs achieve higher levels of performance The configuration cost is hidden

Challenges of pipeline reconfiguration For virtualization to work, cyclic dependencies must fit within one stage of the pipeline –Interconnections to previous or future stages other than the immediate successor are not allowed –Fortunately, this is not a severe restriction on multimedia computations, and the architecture provides pass registers to support forwarding The primary challenge is configuring a computationally significant pipeline stage in one cycle –Wide on-chip configuration buffers must be used Before swapping virtual stages, the state of the resident stage, if any, must be stored. This state needs to be restored when loading this stage once more

The PipeRench architectural class ALUs are LUTs; PEs have access to global I/O bus; PEs can access operands from registered outputs of previous as well as current stage; no interconnect to previous stage

Pass Register File Provides efficient (registered) interstage connections; ALU output can write to any of P registers otherwise register is loaded from previous stage

Interconnection network Full B-bit NxN crossbar; barrel shifter for word- based arithmetic

Evaluation Three architectural parameters: –N (number of PEs per stage) –B (bit-width of ALU and registers), and –P (number of pass registers) Evaluate performance as parameters varied using several kernels: –ATR –Cordic –DCT –FIR –IDEA –Nqueens –Over (Porter-Duff operator for joining two images based on a mask of transparency values for each pixel), and –popCount

Representative results on up to 8 registers

Over all kernels

Speedup Speedup for 8-bit Pes, 8 registers/PE, 128-bit wide stripes (stages)