CALTECH cs184c Spring2001 -- DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 12: May 15, 2001 Interfacing Heterogeneous Computational.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 6: Multicore Systems
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
Lecture 15: Reconfigurable Coprocessors October 31, 2013 ECE 636 Reconfigurable Computing Lecture 15 Reconfigurable Coprocessors.
CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
EECC722 - Shaaban #1 lec # 9 Fall Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability.
EECC722 - Shaaban #1 lec # 8 Fall What is Configurable Computing? Spatially-programmed connection of processing elements Spatially-programmed.
What is Configurable Computing?
EECC722 - Shaaban #1 lec # 7 Fall What is Configurable Computing? Spatially-programmed connection of processing elements Spatially-programmed.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
EECC722 - Shaaban #1 lec # 9 Fall Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
CS294-6 Reconfigurable Computing Day 27 Tuesday, November 24 Integrating Processors and RC Arrays (Part 2)
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
Chapter 17 Parallel Processing.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Trends toward Spatial Computing Architectures Dr. André DeHon BRASS Project University of California at Berkeley.
EECC722 - Shaaban #1 lec # 9 Fall Computing System Element Choices Specialization, Development cost/time Performance/Chip Area (Computational.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
CS294-6 Reconfigurable Computing Day 26 Thursday, November 19 Integrating Processors and RC Arrays.
Configuration. Mirjana Stojanovic Process of loading bitstream of a design into the configuration memory. Bitstream is the transmission.
Penn ESE Spring DeHon 1 FUTURE Timing seemed good However, only student to give feedback marked confusing (2 of 5 on clarity) and too fast.
1 Chapter 13 Embedded Systems Embedded Systems Characteristics of Embedded Operating Systems.
CS294-6 Reconfigurable Computing Day 25 Heterogeneous Systems and Interfacing.
EECC722 - Shaaban #1 lec # 9 Fall Computing System Element Choices Specialization, Development cost/time Performance/Chip Area/Watt (Computational.
CS-334: Computer Architecture
Computer System Architectures Computer System Software
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
The Garp Architecture and C Compiler Brought to you by Liao Jirong T.J. Callahan J.R. Hauser.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Archs, VHDL 3 Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Paper Review: XiSystem - A Reconfigurable Processor and System
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
Automated Design of Custom Architecture Tulika Mitra
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 7: January 24, 2003 Instruction Space.
Computer Architecture Lecture 2 System Buses. Program Concept Hardwired systems are inflexible General purpose hardware can do different tasks, given.
EEE440 Computer Architecture
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #24 – Reconfigurable.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 18: May 19, 2003 Heterogeneous Computation Interfacing.
4 Linking the Components Linking The Components A computer is a system with data and instructions flowing between its components in response to processor.
Exploiting Parallelism
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
بسم الله الرحمن الرحيم MEMORY AND I/O.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
My Coordinates Office EM G.27 contact time:
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 24: May 25, 2005 Heterogeneous Computation Interfacing.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: February 20, 2012 Instruction Space Modeling.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 13: May 17 22, 2001 Interfacing Heterogeneous Computational.
Computing System Element Choices
CSCI 315 Operating Systems Design
Dynamically Reconfigurable Architectures: An Overview
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
Md. Mojahidul Islam Lecturer Dept. of Computer Science & Engineering
Md. Mojahidul Islam Lecturer Dept. of Computer Science & Engineering
Presentation transcript:

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 12: May 15, 2001 Interfacing Heterogeneous Computational Blocks

CALTECH cs184c Spring DeHon Previously Homogenous model of computational array –single word granularity, depth, interconnect –all post-fabrication programmable Understand tradeoffs of each

CALTECH cs184c Spring DeHon Today Heterogeneous architectures –Why? Focus in on Processor + Array hybrids –Motivation –Compute Models –Architecture –Examples

CALTECH cs184c Spring DeHon Why? Why would we be interested in heterogeneous architecture? –E.g.

CALTECH cs184c Spring DeHon Why? Applications have a mix of characteristics Already accepted –seldom can afford to build most general (unstructured) array bit-level, deep context, p=1 –=> are picking some structure to exploit May be beneficial to have portions of computations optimized for different structure conditions.

CALTECH cs184c Spring DeHon Examples Processor+FPGA Processors or FPGA add –multiplier or MAC unit –FPU –Motion Estimation coprocessor

CALTECH cs184c Spring DeHon Optimization Prospect Less capacity for composite than either pure –(A 1 +A 2 )T 12 < A 1 T 1 –(A 1 +A 2 )T 12 < A 2 T 2

CALTECH cs184c Spring DeHon Optimization Prospect Example Floating Point –Task: I integer Ops + F FP-ADDs –A proc =125M 2 –A FPU =40M 2 –I cycles / FP Ops = 60 –125(I+60F)  165(I+F) ( )/40 = I/F 183  I/F

CALTECH cs184c Spring DeHon Motivational: Other Viewpoints Replace interface glue logic IO pre/post processing Handle real-time responsiveness Provide powerful, application-specific operations –possible because of previous observation

CALTECH cs184c Spring DeHon Wide Interest PRISM (Brown) PRISC (Harvard) DPGA-coupled uP (MIT) GARP, Pleiades, … (UCB) OneChip (Toronto) REMARC (Stanford) NAPA (NSC) E5 etc. (Triscend) Chameleon Quicksilver Excalibur (Altera) Virtex+PowerPC (Xilinx)

CALTECH cs184c Spring DeHon Pragmatics Tight coupling important –numerous (anecdotal) results we got 10x speedup…but were bus limited –would have gotten 100x if removed bus bottleneck Speed Up = Tseq/(Taccel + Tdata) –e.g. Taccel = 0.01 Tseq – Tdata = 0.10 Tseq

CALTECH cs184c Spring DeHon Key Questions How do we co-architect these devices? What is the compute model for the hybrid device?

CALTECH cs184c Spring DeHon Compute Models Unaffected by array logic (interfacing) Dedicated IO Processor Instruction Augmentation –Special Instructions / Coprocessor Ops –VLIW/microcoded extension to processor –Configurable Vector unit Autonomous co/stream processor

CALTECH cs184c Spring DeHon Model: Interfacing Logic used in place of –ASIC environment customization –external FPGA/PLD devices Example –bus protocols –peripherals –sensors, actuators Case for: –Always have some system adaptation to do –Modern chips have capacity to hold processor + glue logic –reduce part count –Glue logic vary –value added must now be accommodated on chip (formerly board level)

CALTECH cs184c Spring DeHon Example: Interface/Peripherals Triscend E5

CALTECH cs184c Spring DeHon Model: IO Processor Array dedicated to servicing IO channel –sensor, lan, wan, peripheral Provides –protocol handling –stream computation compression, encrypt Looks like IO peripheral to processor Maybe processor can map in –as needed –physical space permitting Case for: –many protocols, services –only need few at a time –dedicate attention, offload processor

CALTECH cs184c Spring DeHon IO Processing Single threaded processor –cannot continuously monitor multiple data pipes (src, sink) –need some minimal, local control to handle events –for performance or real-time guarantees, may need to service event rapidly –E.g. checksum (decode) and acknowledge packet

CALTECH cs184c Spring DeHon Source: National Semiconductor NAPA 1000 Block Diagram RPC Reconfigurable Pipeline Cntr ALP Adaptive Logic Processor System Port TBT ToggleBus TM Transceiver PMA Pipeline Memory Array CR32 CompactRISC TM 32 Bit Processor BIU Bus Interface Unit CR32 Peripheral Devices External Memory Interface SMA Scratchpad Memory Array CIO Configurable I/O

CALTECH cs184c Spring DeHon Source: National Semiconductor NAPA 1000 as IO Processor SYSTEM HOST NAPA1000 ROM & DRAM Application Specific Sensors, Actuators, or other circuits System Port CIO Memory Interface

CALTECH cs184c Spring DeHon Model: Instruction Augmentation Observation: Instruction Bandwidth –Processor can only describe a small number of basic computations in a cycle I bits  2 I operations –This is a small fraction of the operations one could do even in terms of w  w  w Ops w2 2 (2w) operations

CALTECH cs184c Spring DeHon Model: Instruction Augmentation (cont.) Observation: Instruction Bandwidth –Processor could have to issue w2 (2 (2w) -I) operations just to describe some computations –An a priori selected base set of functions could be very bad for some applications

CALTECH cs184c Spring DeHon Instruction Augmentation Idea: –provide a way to augment the processor’s instruction set –with operations needed by a particular application –close semantic gap / avoid mismatch

CALTECH cs184c Spring DeHon Instruction Augmentation What’s required: –some way to fit augmented instructions into stream –execution engine for augmented instructions if programmable, has own instructions –interconnect to augmented instructions

CALTECH cs184c Spring DeHon “First” Instruction Augmentation PRISM –Processor Reconfiguration through Instruction Set Metamorphosis PRISM-I –68010 (10MHz) + XC3090 –can reconfigure FPGA in one second! –50-75 clocks for operations [Athanas+Silverman: Brown]

CALTECH cs184c Spring DeHon PRISM-1 Results Raw kernel speedups

CALTECH cs184c Spring DeHon PRISM FPGA on bus access as memory mapped peripheral explicit context management some software discipline for use …not much of an “architecture” presented to user

CALTECH cs184c Spring DeHon PRISC Takes next step –what look like if we put it on chip? –how integrate into processor ISA? [Razdan+Smith: Harvard]

CALTECH cs184c Spring DeHon PRISC Architecture: –couple into register file as “superscalar” functional unit –flow-through array (no state)

CALTECH cs184c Spring DeHon PRISC ISA Integration –add expfu instruction –11 bit address space for user defined expfu instructions –fault on pfu instruction mismatch trap code to service instruction miss –all operations occur in clock cycle –easily works with processor context switch no state + fault on mismatch pfu instr

CALTECH cs184c Spring DeHon PRISC Results All compiled working from MIPS binary <200 4LUTs ? –64x3 200MHz MIPS base Razdan/Micro27

CALTECH cs184c Spring DeHon Chimaera Start from PRISC idea –integrate as functional unit –no state –RFUOPs (like expfu) –stall processor on instruction miss, reload Add –manage multiple instructions loaded –more than 2 inputs possible [Hauck: Northwestern]

CALTECH cs184c Spring DeHon Chimaera Architecture “Live” copy of register file values feed into array Each row of array may compute from register values or intermediates (other rows) Tag on array to indicate RFUOP

CALTECH cs184c Spring DeHon Chimaera Architecture Array can compute on values as soon as placed in register file Logic is combinational When RFUOP matches –stall until result ready critical path –only from late inputs –drive result from matching row

CALTECH cs184c Spring DeHon Chimaera Timing If presented –R1, R2 –R3 –R5 –can complete in one cycle If R1 presented last –will take more than one cycle for operation

CALTECH cs184c Spring DeHon Chimaera Results Speedup Compress 1.11 Eqntott 1.8 Life 2.06 (160 hand parallelization) [Hauck/FCCM97]

CALTECH cs184c Spring DeHon Instruction Augmentation Small arrays with limited state –so far, for automatic compilation reported speedups have been small –open discover less-local recodings which extract greater benefit

CALTECH cs184c Spring DeHon Big Ideas Exploit structure –area benefit to –tasks are heterogeneous –mixed device to exploit Instruction description –potential bottleneck –custom “instructions” to exploit

CALTECH cs184c Spring DeHon Big Ideas Model –for heterogeneous composition –limits of sequential control flow