Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002 Prof. Jarmo Takala Institute of Digital and Computer.

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
6/25/2015Platform Design H.Corporaal and B. Mesman1 Platform Design TU/e 5kk70 Henk Corporaal Bart Mesman ASIP Application Specific Instruction-set Processor.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Data-path Synthesis of VLIW Video Signal Processor Zhao Wu and Wayne Wolf Dept. of Electrical Engineering, Princeton University.
November 18, 2004 Embedded System Design Flow Arkadeb Ghosal Alessandro Pinto Daniele Gasperini Alberto Sangiovanni-Vincentelli
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)
1 Chapter 7 Design Implementation. 2 Overview 3 Main Steps of an FPGA Design ’ s Implementation Design architecture Defining the structure, interface.
COM181 Computer Hardware Ian McCrumRoom 5B18,
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
Basics and Architectures
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day3:
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Digital Design and Computer Architecture Dr. Robert D. Kent LT Ext Lecture 1 Introduction.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
Configurable, reconfigurable, and run-time reconfigurable computing.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS252 project presentation.
A Methodology for Architecture Exploration of heterogeneous Signal Processing Systems Paul Lieverse, Pieter van der Wolf, Ed Deprettere, Kees Vissers.
Computer Organization and Design Computer Abstractions and Technology
Transport Triggered Architectures used for Embedded Systems Henk Corporaal EE department Delft Univ. of Technology
Dual-Pipeline Heterogeneous ASIP Design Swarnalatha Radhakrishnan, Hui Guo, Sri Parameswaran School of Computer Science & Engineering University of New.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Graphical Design Environment for a Reconfigurable Processor IAmE Abstract The Field Programmable Processor Array (FPPA) is a new reconfigurable architecture.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
Using Custom Accelerators in Wireless Systems Alex Papakonstantinou, Deming Chen Illinois Center for Wireless Systems Wireless SoC Design Trends and Challenges.
SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
1 The user’s view  A user is a person employing the computer to do useful work  Examples of useful work include spreadsheets word processing developing.
Computer Organization and Architecture Lecture 1 : Introduction
Assembly language.
ECE354 Embedded Systems Introduction C Andras Moritz.
Low-power Digital Signal Processing for Mobile Phone chipsets
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
CDA 3101 Spring 2016 Introduction to Computer Organization
Dynamically Reconfigurable Architectures: An Overview
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Overview Prof. Eric Rotenberg
Presentation transcript:

Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002 Prof. Jarmo Takala Institute of Digital and Computer Systems Tampere University of Technology Tampere, Finland Tel: +358 – ;

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Outline  Motivation  Transport Triggered Architecture (TTA)  Design Methodology for TTAs  Research at TUT  Conclusions

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Motivation  Programmable processors often used in products using digital signal processing (DSP)  Flexibility  Ease of verification  Traditionally DSP processor architectures have been developed based on average performance in several benchmark tasks (~100)  User applications often contain only subset of total benchmarks  Efficiency can be improved by customizing architecture according to given tasks

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Motivation  DSP applications are often hard realtime constrained  execution should be deterministic  dynamic runtime behaviours should be avoided  Static scheduling lends itself to DSP  Current design complexities call for increase in designer productivity  High level languages should be used  DSP algorithms contain inherent parallelism  Instruction level parallelism (ILP) should be maximized

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 What is needed?  Application driven design process with easy design space exploration  Replace hardware complexity by software complexity  Compiler driven process  Use templated architecture  Flexible heterogeneous function units  Modular scalability  Orthogonal compiler friendly

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Choices for Architecture Template Frontend Application sequential (superscalar) dependence (dataflow) independence (EPIC) independence (VLIW) Compilation time (Software) Determine Dependencies Determine Independencies Bind Function Units Determine Dependencies Determine Independencies Bind Function Units Bind Datapaths & Execute Run time (Hardware) ILP Architectures

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 VLIW Gained Popularity in DSP Register File Instruction FetchInstruction Decode Data MemoryInstruction Memory Bypassing Network CPU FU-1 FU-2 FU-3 FU-4 FU-5

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Transport Triggered Architecture  VLIW drawbacks  Bypass complexity  Register file complexity  Register file design restricts FU flexibility  Operation encoding format restricts FU flexibility  Reverse programming paradigm [H. Corporaal, 94]  data transport  operation  Instruction set contains only a single instruction: move

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 From VLIW to TTA Register File Bypassing Network VLIW Instruction Fetch Instruction Decode Instruction Memory FU-1 FU-2 FU-3 FU-4 FU-5 Data Memory Instruction Fetch Instruction Decode Bypassing Network FU-1 FU-2 FU-3 FU-4 FU-5 Register File TTA

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 TTA Datapath Integer ALU Float ALU Boolean RF Float RF Integer RF Socket Instruction Memory Data Memory Load/Store Unit Immediate Unit Instruction Unit

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Function Units  Operands written to operand registers (O)  Operation performed when last operand written to trigger register (T)  Pipeline synchronized with control bits (C)  Standard interface  FU_ready  Result_ready  Global_lock T optional Optional shadow register O logic R C C C C

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 ILP Architectures Frontend Application sequential (superscalar) dependence (dataflow) independence (EPIC) independence (VLIW) Compilation time independence (TTA) Determine Dependencies Determine Independencies Bind Function Units Bind Datapaths Execute Determine Dependencies Determine Independencies Bind Function Units Bind Datapaths Run time

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 TTA Characteristics: HW  Modular  Can be constructed with standard building blocks  Very flexible and scalable  FU functionality can be arbitrary  Supports user defined Special Function Units (SFU)  Lower complexity  Reduction on # register ports  Reduced bypass complexity  Reduction in bypass connectivity  Reduced register pressure  Trivial decoding (implies long instructions)

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 TTA Characteristics: SW  Traditional operation-triggered instruction:  Transport-triggered instruction:  Reminds dataflow and time-stationary coding mul r1,r2,r3; r1  mul.o; r2  mul.t; mul.r  r3; r1  mul.o, r2  mul.t; mul.r  r3; or

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 TTA Design Tools  Design tools based on TTA architecture template have been developed at Delft University of Technology (DUT), Delft, the Netherlands  MOVE project lead by Prof. Henk Corporaal  Fully parametric C/C++ Compiler buses, connections, function units, register files, etc.  Design space explorer  Processor generator

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Sequential Simulator Code Generation Trajectory I/O Parallel Code GCC or SUIF Profiling Data Parallel Simulator Compiler Backend Sequential Code Application (C/C++) Architecture Description Compiler Frontend I/O (MOVE Project at DUT)

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 TTA Specific Optimizations  TTA allows extra scheduling optimizations  E.g., software bypassing  Bypassing can eliminate the need of RF access  However, more difficult to schedule ! Example:r1 → add.o, r2 → add.t; add.r → r3; r3 → sub.o, r4 → sub.t sub.r → r5; Translates to:r1 → add.o, r2 → add.t; add.r → sub.o, r4 → sub.t; sub.r → r5;

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Resource Optimization Connectivity Optimization Design Space Exploration Application (C/C++) Map&Schedule Frontend FU models Cost Functions Simulator Resources (Mach) Map&Schedule Design Point Simulator Design Points Select Resources Reduce Connections (MOVE Project at DUT)

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Exploration: Resourse Optimization Pareto curve represents the lowest bound of found architecture configurations Selected architecture for further optimization (MOVE Project at DUT)

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Exploration: Connectivity Optimization (MOVE Project at DUT) Reduced connections decrease bus delay Critical connections have been removed IRU ALU IU LSU IU LSU IU LSU

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Topics to be Investigated  Poor code density  good target for code compression techniques apriori information of application, thus instruction propabilities known  Estimations  Power estimation  Fast estimations with sufficient accuracy  Flexibity, reuse  Applications may change, thus additional resources need to assigned although not needed by the original application  Tool-assisted special function unit generation  Analysis support  Model creation support  Characterization support  Parameterized processor generator  Interconnections, control, etc. maybe realized in several ways depending on the target  Low-power optimizations  Clustered TTAs  Interprocessor communication schemes  These topics considered in FlexDSP Project at TUT

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Code Compression New Design Environment Functionality (C/C++) Operation Analysis Parametric Compiler Parametric Processor Generator Parallel Object Code HDL Code Frontend Design Space Exploration FU models (C, HDL) Cost Functions (area, power, speed) Resource Constraints TTA Processor SFU Generation Target of FlexDSP Project at TUT

J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Conclusions  Design methodologies allowing processor customization will improve efficiency in certain application areas, e.g., multimedia, telecom  TTA is a promising candidate for architectural template for customized processors  In particular, support for custom function units allows powerful tailoring  Results of MOVE project at DUT have already proven the concept  Parameterized compiler allows tool-assisted design space exploration  Still more research needed on  Hardware implementations  Enhanced compiler strategies