Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002 Prof. Jarmo Takala Institute of Digital and Computer Systems Tampere University of Technology Tampere, Finland Tel: +358 – ;
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Outline Motivation Transport Triggered Architecture (TTA) Design Methodology for TTAs Research at TUT Conclusions
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Motivation Programmable processors often used in products using digital signal processing (DSP) Flexibility Ease of verification Traditionally DSP processor architectures have been developed based on average performance in several benchmark tasks (~100) User applications often contain only subset of total benchmarks Efficiency can be improved by customizing architecture according to given tasks
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Motivation DSP applications are often hard realtime constrained execution should be deterministic dynamic runtime behaviours should be avoided Static scheduling lends itself to DSP Current design complexities call for increase in designer productivity High level languages should be used DSP algorithms contain inherent parallelism Instruction level parallelism (ILP) should be maximized
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 What is needed? Application driven design process with easy design space exploration Replace hardware complexity by software complexity Compiler driven process Use templated architecture Flexible heterogeneous function units Modular scalability Orthogonal compiler friendly
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Choices for Architecture Template Frontend Application sequential (superscalar) dependence (dataflow) independence (EPIC) independence (VLIW) Compilation time (Software) Determine Dependencies Determine Independencies Bind Function Units Determine Dependencies Determine Independencies Bind Function Units Bind Datapaths & Execute Run time (Hardware) ILP Architectures
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 VLIW Gained Popularity in DSP Register File Instruction FetchInstruction Decode Data MemoryInstruction Memory Bypassing Network CPU FU-1 FU-2 FU-3 FU-4 FU-5
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Transport Triggered Architecture VLIW drawbacks Bypass complexity Register file complexity Register file design restricts FU flexibility Operation encoding format restricts FU flexibility Reverse programming paradigm [H. Corporaal, 94] data transport operation Instruction set contains only a single instruction: move
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 From VLIW to TTA Register File Bypassing Network VLIW Instruction Fetch Instruction Decode Instruction Memory FU-1 FU-2 FU-3 FU-4 FU-5 Data Memory Instruction Fetch Instruction Decode Bypassing Network FU-1 FU-2 FU-3 FU-4 FU-5 Register File TTA
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 TTA Datapath Integer ALU Float ALU Boolean RF Float RF Integer RF Socket Instruction Memory Data Memory Load/Store Unit Immediate Unit Instruction Unit
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Function Units Operands written to operand registers (O) Operation performed when last operand written to trigger register (T) Pipeline synchronized with control bits (C) Standard interface FU_ready Result_ready Global_lock T optional Optional shadow register O logic R C C C C
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 ILP Architectures Frontend Application sequential (superscalar) dependence (dataflow) independence (EPIC) independence (VLIW) Compilation time independence (TTA) Determine Dependencies Determine Independencies Bind Function Units Bind Datapaths Execute Determine Dependencies Determine Independencies Bind Function Units Bind Datapaths Run time
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 TTA Characteristics: HW Modular Can be constructed with standard building blocks Very flexible and scalable FU functionality can be arbitrary Supports user defined Special Function Units (SFU) Lower complexity Reduction on # register ports Reduced bypass complexity Reduction in bypass connectivity Reduced register pressure Trivial decoding (implies long instructions)
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 TTA Characteristics: SW Traditional operation-triggered instruction: Transport-triggered instruction: Reminds dataflow and time-stationary coding mul r1,r2,r3; r1 mul.o; r2 mul.t; mul.r r3; r1 mul.o, r2 mul.t; mul.r r3; or
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 TTA Design Tools Design tools based on TTA architecture template have been developed at Delft University of Technology (DUT), Delft, the Netherlands MOVE project lead by Prof. Henk Corporaal Fully parametric C/C++ Compiler buses, connections, function units, register files, etc. Design space explorer Processor generator
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Sequential Simulator Code Generation Trajectory I/O Parallel Code GCC or SUIF Profiling Data Parallel Simulator Compiler Backend Sequential Code Application (C/C++) Architecture Description Compiler Frontend I/O (MOVE Project at DUT)
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 TTA Specific Optimizations TTA allows extra scheduling optimizations E.g., software bypassing Bypassing can eliminate the need of RF access However, more difficult to schedule ! Example:r1 → add.o, r2 → add.t; add.r → r3; r3 → sub.o, r4 → sub.t sub.r → r5; Translates to:r1 → add.o, r2 → add.t; add.r → sub.o, r4 → sub.t; sub.r → r5;
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Resource Optimization Connectivity Optimization Design Space Exploration Application (C/C++) Map&Schedule Frontend FU models Cost Functions Simulator Resources (Mach) Map&Schedule Design Point Simulator Design Points Select Resources Reduce Connections (MOVE Project at DUT)
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Exploration: Resourse Optimization Pareto curve represents the lowest bound of found architecture configurations Selected architecture for further optimization (MOVE Project at DUT)
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Exploration: Connectivity Optimization (MOVE Project at DUT) Reduced connections decrease bus delay Critical connections have been removed IRU ALU IU LSU IU LSU IU LSU
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Topics to be Investigated Poor code density good target for code compression techniques apriori information of application, thus instruction propabilities known Estimations Power estimation Fast estimations with sufficient accuracy Flexibity, reuse Applications may change, thus additional resources need to assigned although not needed by the original application Tool-assisted special function unit generation Analysis support Model creation support Characterization support Parameterized processor generator Interconnections, control, etc. maybe realized in several ways depending on the target Low-power optimizations Clustered TTAs Interprocessor communication schemes These topics considered in FlexDSP Project at TUT
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Code Compression New Design Environment Functionality (C/C++) Operation Analysis Parametric Compiler Parametric Processor Generator Parallel Object Code HDL Code Frontend Design Space Exploration FU models (C, HDL) Cost Functions (area, power, speed) Resource Constraints TTA Processor SFU Generation Target of FlexDSP Project at TUT
J.Takala/TUTBerkeley – Finland Day, Oct.18, 2002 Conclusions Design methodologies allowing processor customization will improve efficiency in certain application areas, e.g., multimedia, telecom TTA is a promising candidate for architectural template for customized processors In particular, support for custom function units allows powerful tailoring Results of MOVE project at DUT have already proven the concept Parameterized compiler allows tool-assisted design space exploration Still more research needed on Hardware implementations Enhanced compiler strategies