Presentation is loading. Please wait.

Presentation is loading. Please wait.

Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical.

Similar presentations


Presentation on theme: "Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical."— Presentation transcript:

1 Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical Engineering Princeton University ++ : NEC Laboratories America, Inc.

2 Outline SoC design constraints Background  Previous work in ASIP design  Xtensa platform  Manual custom instruction generation procedure Automatic custom instruction generation flow Experimental results Conclusions

3 SoC Design Constraints Time to market Cost Performance Power Cost-performance trade-off Flexibility ……

4 Comparison of Different Approaches ASICASIPGPP Time to market -- + ++ Cost ++ + -- Performance ++ + -- Power ++ + -- Cost-performance ++ + -- Flexibility -- + ++ ++ Very good + Good -- Very bad

5 Domain Specific Processor (DSP) General Embedded Processor 0.1-1 MIPS/mw 1-10 MIPS/mw 50-100 MIPS/mw 500-1000 MOPS/mw Energy Efficiency Flexibility ASIC ASIP (Xtensa) Domain Specific Processor (AMD-K6E) 0.1-1 MIPS/mW 1-10 MIPS/mW 50-100 MIPS/mW 500-1000 MOPS/mW Energy Efficiency Flexibility Flexibility vs. Energy Efficiency

6 Previous Work in ASIP Design ASIP architectures and overall design methodologies  [Huang, 1994], [Adams, 1996], [Fisher, 1999], [Kucukcakar, 1999] Application-specific instruction set selection  [Choi, 1999], [Gschwind, 1999], [Arnold, 1999] Low power ASIP design  [Kalambur, 1997], [Dougherty, 1999], [Ishihara, 2000], [Sami, 2001] Commercial offerings  Xtensa, ARCtangent, Jazz, SP-5flex, Carmel

7 Processor Controls TRACE Port JTAG Tap Control On Chip Debug Align and Decode Coprocessor Register File Coprocessor Execution Units Window Register File ALU & Address Generation MAC 16 Designer Defined Instruction Execution Unit Instruction Memory or Cache & Tags Branch Logic & Instruction Fetch Date Memory or Cache &Tags Processor Interface Write Buffer Timers 1 to n Special Function Register Access Data Address Watch 0 to n Instruction Address Watch 0 to n Instruction Base ISA Feature Configurable Function Optional Function Configurable & Optional Function Extensible Data Instruction Address Data Address Exception Support Interrupt Control Memory Protection Unit Source: www.tensilica.com Xtensa Architecture

8 Xtensa Processor Design Flow Processor Configuration Inputs Designer-Defined Instruction Descriptions Configuration File Configured GNU C/C++ Compiler Configured GNU Assembler/ Disassembler Configured Instruction Set Simulator/Emulator Configured Processor HDL Area, Power and Timing Estimation Logic Synthesis (Synopsys or Ambit) Block Place/Route (Avant! Or Cadence) Timing Verification Hardware Profile Application Specific Compile, Assemble, Link Application Simulation with ISS and/or Emulator Software Debugging/Profiling Application Source Code Sample Application Data Optimized Software Optimized Hardware Generator Output Internal Database Design data Use of Generated Data Source: www.tensilica.com

9 Manual Custom Instruction Generation Procedure Identify potential new instructions Describe custom instructions Insert custom instructions Verify functional correctness Profile, read source code Understand source code Rewrite source code Slow and error-prone

10 Contributions of Our Work Automatic custom instruction selection  Application program to extensible processors with custom instructions Features  Efficient design space search  Use accurate information from instruction set simulator and synthesis  Bridge the gap between automatic synthesized and manually designed architectures

11 Automatic Custom Instruction Generation Flow

12

13 Example Illustration of Template Generation

14

15

16

17

18 Key Observations for Pruning Higher the weight of the template, higher the potential for improvement --- Amdahl’s law Scope for optimization determined by computation --- No. of cycles needed for executing the template Scope for optimization determined by read/write ports limitation --- Additional cycles needed for extra reading/writing of input/output variables

19 Pruning Algorithm Ranking criterion:  OriginalTime: Fraction of the total execution time of the original program spent in the template (weight)  In, Out: Number of inputs and outputs of the template, respectively  α, β: Number of inputs/outputs encoded in the instruction  γ: No. of cycles needed for executing the template Higher priority means greater potential for speed up

20 12.73 Template Generation with Pruning 10.51 7.92 4.05 2.13 Ranked pool of seed templates 12.73 Highest priority 5.36 1.1816.35 Threshold: 0.1 Template set

21 4.05 2.13 10.51 7.92 5.36 10.51 7.92 4.05 2.13 Template Generation with Pruning 12.73 Highest priority 5.36 1.1816.35 12.73 Threshold: 0.1 Template set Ranked pool of seed templates

22 12.73 4.05 2.13 10.51 7.92 5.36 Template Generation with Pruning 12.73 Highest priority 1.18 16.35 1.18 Threshold: 0.1 Template set Ranked pool of seed templates

23 4.05 2.13 10.51 7.92 5.36 16.3512.7316.35 Template Generation with Pruning 12.73 Highest priority 16.35 4.05 2.13 10.51 7.92 5.36 Threshold: 0.1 Template set Ranked pool of seed templates

24 No. of Templates vs. Threshold Ratio

25 Automatic Custom Instruction Generation Flow

26 Automatic Custom Instruction Generation Flow (Contd.)

27

28 Custom Instruction Insertion Care must be taken to insert custom instructions into appropriate places without affecting program’s functional correctness If custom instructions need extra inputs (outputs), care must be taken to select appropriate variables to write to (read from) user-defined registers

29 Example Illustration of Custom Instruction Insertion

30 Example Illustration of Custom Instruction Insertion (Contd.) (a) (b).... offset = t + 1; for (i=0; i<100; i++) { j =.... result = offset + i * j; }........ offset = t + 1; for (i=0; i<100; i++) { j =.... result = CustomInstr(i,j); }.... WUR(offset,0);

31 Automatic Custom Instruction Generation Flow

32 Custom Instruction Combination Selection --- Problem Statement Given a set of non-overlapping custom instructions, with each instruction having several versions, find a version for each instruction such that performance is maximized while area is under a certain threshold

33 Custom Instruction Combination Selection --- Flow Chart

34 Automatic Custom Instruction Generation Flow

35 Experimental Methodology C Program Automatic Custom Instruction Generation Aristotle Xtensa TIE Compiler Synopsys Design Compiler Xtensa GNU Profiler Custom Processor (HDL Description) NEC CB11 TIE Tensilica Processor Generator Synopsys Design Compiler Modified C program Cross Compiler ISS Sente Wattwatcher AreaClock Period Execution Cycles Power

36 Experimental Results (Contd.) Average Performance improvement: 3.4X Energy reduction: 3.2X Energy*delay reduction: 12.6X Area increase: 1.8%

37 Conclusions Automatic custom instruction synthesis for ASIPs  Template generation/selection  Custom instruction insertion  Custom instruction combination selection Experimental results  3.4X average performance improvement  12.6X average energy*delay reduction


Download ppt "Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical."

Similar presentations


Ads by Google