Presentation is loading. Please wait.

Presentation is loading. Please wait.

PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Eugene Earlie 2.

Similar presentations


Presentation on theme: "PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Eugene Earlie 2."— Presentation transcript:

1 PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Eugene Earlie 2 1 Center For Embedded Computer Systems, University of California, Irvine, CA, USA 2 Strategic CAD Labs, Intel, Hudson, MA, USASC L

2 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 2 Bypassing Improves Performance Pipelining improves performance Pipelining improves performance Limited by pipeline hazards Bypasses eliminate certain data hazards Bypasses eliminate certain data hazards Further improve performance FD RF R1  R2 + R3R4  R4 + R1 FD OR X1 RF X2 WB R1  R2 + R3R4  R4 + R1 OR X1 X2 WB R1

3 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 3 Area and Power consumption Area and Power consumption Wide multiplexers Bypass Control logic Bypass wires Impact of Bypassing Cycle time Cycle time Bypasses may be a part of timing-critical path FDX1 RFX2 WB M1 M2 Wiring congestion Wiring congestion Overall chip complexity Overall chip complexity deeply pipelined out-of-order processors P. Ahuja et al., The Performance Impact of incomplete bypassing in processor pipelines MICRO 1995 A. Abnous and N. Bagerzadeh, Pipelining and bypassing in a VLIW processor, IEEE Trans... 1995. OR

4 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 4 Problem, Solution and Problem Problem – How do I customize bypasses? Problem – How do I customize bypasses? Important for Embedded Systems Solution – Solution – Keep only the most beneficial bypasses Area, Power and Performance trade-off FDORX1 RF X2 WB Problems – Problems – How to Compile for a processor with partial bypassing? Requires Compiler-in-the-Loop Exploration

5 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 5 Related Work Optimizations for partial bypassing Optimizations for partial bypassing P. Ahuja et al. [MICRO’95] Manual code generation Manual code generation M. Buss et al. [CASES’01] Optimize inter-cluster copy operations Optimize inter-cluster copy operations K. Fan et al. [ASSP’03] FU-allocation strategy FU-allocation strategy Only for VLIW processors A. Shrivastava et al. [CODES’04] A generic “pipeline hazard detection” mechanism to generate bypass- sensitive code A generic “pipeline hazard detection” mechanism to generate bypass- sensitive code We present A generic Compiler-in-the-Loop bypass exploration framework Perform area-power-performance trade-off on Intel XScale by varying bypasses

6 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 6 PBExplore: A CIL Exploration Framework Bypass Configuration Power Simulator Stimulus Energy Estimate Bypass-control Logic Synthesis Tool Area Estimate Bypass-sensitive Compiler Executabl e Cycle-accurate Simulator Application Report Execution Cycles

7 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 7 Bypass Sensitive Scheduling No Hazard Bypasses transfer data between dependent operations Bypasses transfer data between dependent operations Missing bypasses cause pipeline hazard Missing bypasses cause pipeline hazard Hazard FD OR X1 RF X2 WB R1  R2 + R3R4  R4 + R1 R1 R1  R2 + R3 R1 R1  R2 + R3 R1 Bypass-sensitive compiler should be able to Bypass-sensitive compiler should be able to detect and avoid pipeline hazards

8 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 8 Operation Table Operation Table Operation Table for ADD R1 R2 R3 FDORX1 RF X2 WB C1C2 C3 BRF C4 C5 Operation Table is a binding between Operation Table is a binding between Operation and Processor Resources and Registers Can detect Resource Hazards Can detect Resource Hazards OTs model processor resources Can detect Data Hazards Can detect Data Hazards OTs model processor registers 1. F 2. D 3. OR ReadOperands R2 C1 RF R3 C2 RF C5 BRF DestOperands R1 RF 4. X1 WriteOperands R1 C4 BRF 5. X2 6. XWB WriteOperands R1 C3 RF Details are in the paper !!

9 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 9 Experiments Experiments I – Need of a CIL framework Experiments I – Need of a CIL framework Need of Bypass-sensitive Compiler-in-the-Loop Exploration Traditional exploration versus Bypass-sensitive Compiler-in-the-Loop exploration Experiments II – CIL Exploration Experiments II – CIL Exploration Use of Bypass-sensitive Compiler-in-the-Loop Exploration Perform Power-Performance-Area trade-offs Identify alternate interesting design points

10 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 10 Experiments I - Framework Traditional Exploration versus Bypass-sensitive Compiler-in-the-Loop Exploration Application Bypass Configuration gcc –O3 Executable Traditional Cycles Cycle Accurate Simulator Traditional Exploration CIL Cycles OT-based Compiler Executable Cycle Accurate Simulator Bypass-sensitive Compiler-in-the-Loop Exploration

11 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 11 Experiments I - Setup 7 pipeline stages can bypass result 7 pipeline stages can bypass result We vary which pipeline stage bypasses a result We vary which pipeline stage bypasses a result 2 7 = 128 bypass configurations Encode bypass configuration Configuration 28 = Bypass paths from MWB, M2 and XWB are present Bypass paths from MWB, M2 and XWB are present F1F2IDRFX1X2XWB M1 D1D2DWB MWBM2

12 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 12 Bypass Explorations on XScale CIL-compiler can effectively exploit the bypass configuration CIL-compiler can effectively exploit the bypass configuration Significant performance difference Significant performance difference bitcount 850000 900000 950000 1000000 1050000 1100000 1150000 1200000 1250000 0326496128 Bypass Source Configurations Execution Cycles Traditional CIL

13 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 13 X-bypass explorations in XScale XWB X1X2 XWB X2 X2 X1 XWB X1 XWB X2 X1 X-bypass Configuration bitcount 850000 900000 950000 1000000 1050000 1100000 1150000 1200000 - Execution Cycles Traditional CIL Difference in trends F1F2IDRFX1X2XWB M1 D1D2DWB MWBM2

14 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 14 M-bypass explorations in XScale Difference in trends X1X2XWB D1D2DWB F1F2IDRF M1MWBM2

15 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 15 bitcount 860000 880000 900000 920000 940000 960000 980000 -DWBD2DWB D2 D Bypass Configurations Execution Cycles Traditional CIL D-bypass exploration in XScale Difference in trends X1 D1D2DWB F1F2IDRF X2XWB M1MWBM2

16 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 16 Experiments II - Setup Intel XScale Microarchitecture Programmers Reference Manual, Intel XScale Microarchitecture Programmers Reference Manual, http://www.developer.intel.comhttp://www.developer.intel.com M. R. Gauthus et al. MiBench: A free commercially representative…, IEEE Workshop… 2001 Synopsys Design Compiler, 2001, http://www.synopsys.com/products/logic/design compiler.htmlhttp://www.synopsys.com/products/logic/design compiler.html Power-Performance- Area trade-offs Scheduler Scheduler Exhaustive instruction reordering within Basic Blocks Synthesis Tool Synthesis Tool Synopsys Design compiler 2001.10 0.8µ library lsi_10k Power Estimation Power Estimation Synopsys power_estimate Bypass Configuration Synthesis Tool Bypass-sensitive Compiler Executable Cycle-accurate Simulator Power Simulator Bypass Control Logic Application Report Application

17 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 17 Performance-Energy-Area Trade-off Point 2 Point 1 Design Point 1 Design Point 1 no bypass from MWB and XWB to first operand 18% less area and 14% less energy consumption of bypass control logic 2% performance loss Design Point 2 Design Point 2 Only D2 and X2 bypass to first operand 25% less area and 16% less energy consumption of bypass control logic 6% performance loss

18 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 18 Summary Bypassing improves performance but is costly in terms of area and power Bypassing improves performance but is costly in terms of area and power Partial bypassing presents valuable trade-offs, however poses challenges in compilation Partial bypassing presents valuable trade-offs, however poses challenges in compilation We presented PBExplore – A Compiler-in-the-Loop Exploration framework to explore partial bypasses. We presented PBExplore – A Compiler-in-the-Loop Exploration framework to explore partial bypasses. PBExplore uses Operation Tables to generate bypass-sensitive code PBExplore automatically synthesizes bypass control logic to explore power and area trade-offs PBExplore is able to discover interesting design points that trade- off performance for power and area of bypass control logic PBExplore is able to discover interesting design points that trade- off performance for power and area of bypass control logic

19 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 19 Thank You

20 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 20 Pipeline Hazard Detection using OT FD OR X1 RF X2 WB C1C2 C3 BRF C4 C5 Cycle Busy Resources !RFBRF MUL R1 R2 R3 1F-- 2D-- 3OR, C1, C2-- 4X1R1- 5X1, C4R1R1 6X2R1- 7WB, C3-- 8 -- 9 -- 10 -- 11 --

21 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 21 Resource Hazard Detection FD OR X1 RF X2 WB C1C2 C3 BRF C4 C5 Cycle Busy Resources !RFBRF MUL R1 R2 R3 ADD R4 R2 R3 1F-- 2D F-- 3OR, C1, C2 D-- 4X1 OR, C1, C2R1- 5 X1, C4 RH R1, R4 R1 6X2 X1, C4 R1, R4 R4 7WB, C3 X2R4- 8 WB, C3-- 9 -- 10 -- 11 -- Resource Hazard

22 Copyright © 2005 UCI ACES Laboratory DATE, March 10, 2005 22 Data Hazard Detection FD OR X1 RF X2 WB C1C2 C3 BRF C4 C5 Cycle Busy Resources !RFBRF MUL R1 R2 R3 ADD R4 R2 R3 SUB R5 R4 R2 1F-- 2D F-- 3OR, C1, C2 D F-- 4X1 OR, C1, C2 DR1- 5 X1, C4 RH DH R1, R4 R1 6 X2 X1, C4 DH R1, R4 R4 7 WB, C3 X2 DHR4- 8 WB, C3 OR, C1, C2-- 9 X1, C4R5R5 10 X2R5- 11 WB, C3-- Data Hazard


Download ppt "PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Eugene Earlie 2."

Similar presentations


Ads by Google