CGRA QUIZ. Quiz What is the fundamental drawback of fine-grained architecture that led to exploration of coarse grained reconfigurable architectures?

Slides:



Advertisements
Similar presentations
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
Advertisements

The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
The Microprocessor is no more General Purpose. Design Gap.
Instruction-Level Parallelism (ILP)
Presenter: Jeremy W. Webb Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Processor Architectures At A Glance: M.I.T.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Course-Grained Reconfigurable Devices. 2 Dataflow Machines General Structure:  ALU-computing elements,  Programmable interconnections,  I/O components.
The Raw Processor: A Scalable 32 bit Fabric for General Purpose and Embedded Computing Presented at Hotchips 13 On August 21, 2001 by Michael Bedford Taylor.
Computer Organization. This module surveys the physical resources of a computer system. –Basic components CPUMemoryBus I/O devices –CPU structure Registers.
A tiled processor architecture prototype: the Raw microprocessor October 02.
SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
Chapter 6 Memory and Programmable Logic Devices
Router Architectures An overview of router architectures.
Chapter 5 Array Processors. Introduction  Major characteristics of SIMD architectures –A single processor(CP) –Synchronous array processors(PEs) –Data-parallel.
Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18,
Evaluating the Raw microprocessor Michael Bedford Taylor Raw Architecture Group Computer Science and AI Laboratory Massachusetts Institute of Technology.
Gigabit Routing on a Software-exposed Tiled-Microprocessor
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
C.S. Choy95 COMPUTER ORGANIZATION Logic Design Skill to design digital components JAVA Language Skill to program a computer Computer Organization Skill.
Computer Organization Computer Organization & Assembly Language: Module 2.
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Archs, VHDL 3 Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/
General Concepts of Computer Organization Overview of Microcomputer.
CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,
Computer Organization & Assembly Language © by DR. M. Amer.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY
Stored Program A stored-program digital computer is one that keeps its programmed instructions, as well as its data, in read-write,
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.
The Raw Architecture A Concrete Perspective Michael Bedford Taylor Raw Architecture Group Laboratory for Computer Science Massachusetts Institute of Technology.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Fundamentals of Programming Languages-II
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #23 – Function.
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,
1 Versatile Tiled-Processor Architectures The Raw Approach Rodric M. Rabbah with Ian Bratt, Krste Asanovic, Anant Agarwal.
Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Overview of microcomputer structure and operation
BASIC COMPUTER ARCHITECTURE HOW COMPUTER SYSTEMS WORK.
Assembly Language for Intel-Based Computers, 5th Edition
Vector Processing => Multimedia
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
Parallel and Multiprocessor Architectures
URECA: A Compiler Solution to Manage Unified Register File for CGRAs
RAW Scott J Weber Diagrams from and summary of:
Presentation transcript:

CGRA QUIZ

Quiz What is the fundamental drawback of fine-grained architecture that led to exploration of coarse grained reconfigurable architectures? (Max of 5 words!) Give two examples for each coarse grained architecture type: Mesh, Linear Array, and Crossbar Indicate whether the given architecture supports some form of partial reconfiguration or not. PipeRanch, KressArray, Chess

COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/21/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY - 2 3

Outline Coarse Grained Reconfigurable Architectures RAW CHESS Basics Of Network On Chip(NoC) Project Overview 4

Raw Architecture Workstation (RAW) Developed at MIT It fully exposes Low Level hardware architectural details to the compiler It lacks hardware for register renaming and dynamic instruction issue A Raw architecture seeks to execute pipelined application (like signal processing) efficiently. Motivation ??? Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997):

Change Is Around the Corner Processor performance not scaling as before Wire delay and power old view: chip looks small to a wire chip size distance signal can travel in 1 cycle new view: chip looks much bigger to a wire, communication is expensive even on chip! Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997):

Raw Architecture How do we arrive at this design??? Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997):

Problems with Monolithic Designs Super-wide general purpose processors are no longer practical Wide Fetch (16 inst) Unified Load/Store Queue PC RF ALU Bypass Net control Centralized control with global operand routing Area, power, and frequency concerns Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997):

Wide Fetch (16 inst) Unified Load/Store Queue PC RF ALU Bypass Net control + >>

ALU Bypass Net RF Spatial Architectures Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997):

ALU RF Bypass Net Spatial Architectures Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997):

ALU RF Spatial Architectures Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997):

ALU RF >> + Exploiting Locality Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997):

ALU RF Distribute the Register File Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997):

ALU RF Control Wide Fetch (16 inst) Unified Load/Store Queue PC I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ Distribute the Rest Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997):

ALU RF I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ Tiled-Processor Architecture Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997):

Tiled-Processor Architecture Make a tile as big as you can go in one clock cycle, and expose longer communication to the programmer Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): Tile abstraction is quite powerful –e.g., power → resources used as necessary Easily scalable All signals registered at tile boundaries, no global signals Easier to Tune the Frequency Easier to do the Physical Design Easier to Verify 17

Raw On-Chip Networks 2 Static Networks Provides low latency communication between tiles. Makes routing decision during compile time. 2 Dynamic Networks Header encodes destination. Transports unpredictable operations like interrupt and cache misses. Computation Resources Switch Processor Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997):

Inside the Compute Processor IFRFD ATL M1M2 FP E U TV F4WB r26 r27 r25 r24 Input FIFOs from Static Router r26 r27 r25 r24 Output FIFOs to Static Router Local Bypass Network 19

20 Raw Compiler Example tmp3 = (seed*6+2)/3 v2 = (tmp1 - tmp3)*5 v1 = (tmp1 + tmp2)*3 v0 = tmp0 - v1 …. pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 Assign instructions to the tiles, maximizing locality. Generate the static router instructions to transfer Operands & streams tiles. [Slide Source: Michael B. Taylor] Raw tile

Architectural Comparison RAWSuperscalarMultiprocessor Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997):

Application Mapping on RAW [ Four-way parallelized scalar code Two-way threaded Java program httpdZzzz.. Video Data Stream Frame Buffer And Screen Custom Data Path Pipeline (by Compiler) Sleep Mode (power saving) Fast Inter-tile ALU forwarding : 3 cycles Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997):

RAW - Performance Taylor, Michael Bedford, et al. "Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams." ACM SIGARCH Computer Architecture News. Vol. 32. No. 2. IEEE Computer Society,

CHESS - A Reconfigurable Arithmetic Array For Multimedia Applications Designed by Hewlett Packard laboratories in the year 1999 Aims at speeding up arithmetic operations for multimedia applications and tries to improve memory density Principle goals of CHESS Increased arithmetic computational density Increased memory bandwidth Increased capacity of internal memories Enhanced Flexibility Rapid Reconfiguration 24 Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."

CHESS - Architecture 4 bit ALUs 4 bit bus wiring Switchboxes Chessboard Layout Embedded block RAM’s Speed and hierarchical line lengths Small configuration memories No run-time reconfiguration 25 Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."

CHESS - Components ALU L OGIC D ESIGN 26 Switchbox Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."

CHESS - Routing Structure 27 Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."

CHESS - Performance 28 High computational density Efficient multiplies due to embedded ALU Issues: No reported software or application results No run-time reconfiguration Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."

Comparison: CHESS and MATRIX Both use 2D array of ALUs For both, instructions can be generated within the array Both the architectures are flexible CHESS is 4 bit whereas MATRIX is 8 bit CHESS does not support run-time reconfiguration but has very fast configuration as few bits are required CHESS has high computational density CHESS is aimed at arithmetic operations whereas MATRIX is more general purpose 29

Network-On-Chip(NoC) 30

Project Overview Implementing Coarse Grained and Hybrid Reconfigurable Architecture NoC interconnection between processing elements Supports Variable Block Size Motion Estimation Motion Estimation Algorithms Full Search Diamond Search 31 Verma, Ruchika, and Ali Akoglu. "A coarse grained and hybrid reconfigurable architecture with flexible NoC router for variable block size motion estimation." Parallel and Distributed Processing, IPDPS IEEE International Symposium on. IEEE, 2008.

CPE (1,1) CPE (1,1) CPE (2,1) CPE (2,1) CPE (3,1) CPE (3,1) CPE (4,1) CPE (4,1) CPE (1,2) CPE (1,2) CPE (2,2) CPE (2,2) CPE (3,2) CPE (3,2) CPE (4,2) CPE (4,2) CPE (1,3) CPE (1,3) CPE (2,3) CPE (2,3) CPE (3,3) CPE (3,3) CPE (4,3) CPE (4,3) CPE (1,4) CPE (1,4) CPE (2,4) CPE (2,4) CPE (3,4) CPE (3,4) CPE (4,4) CPE (4,4) c_d r_d c_d r_d c_d r_d c_d PE 2(1) PE 2(3) PE 2(2) PE 2(4) PE 3 Main Memory Memory Interface (MI) data_load_control (16 bits) reference_block_id (5 bits) c_d_(x,y) (32 bits) r_d_(x,y) (32 bits) 32 bits 14 bits 12 bits 32

QUESTIONS?? 33

34