Improved On-Chip Analytical Power and Area Modeling Andrew B. Kahng Bill Lin Kambiz Samadi University of California, San Diego January 20, 2010.

Improved On-Chip Analytical Power and Area Modeling Andrew B. Kahng Bill Lin Kambiz Samadi University of California, San Diego January 20, 2010

Outline  Motivation  Implementation Flow and Design of Experiments  Modeling Methodology  Modeling Problem  Power Modeling  Area Modeling  Experimental Results and Discussion  Conclusions

Motivation  Many-core chip  NoCs needed to interconnect many-core chips  Power-efficiency of NoCs is important  Performance was the primary concern  Now power efficiency is critical  28% of total power in Intel 80-core Teraflops chip is due to interconnection networks (routers + links);  Need rapid power estimation to trade off alternative architectures  Rapid power-area tradeoffs at the architectural level

Related Work  Real-chip power measurements (Isci et al. ’03)  RTL-level NoC power estimations (A. Banerjee et al. ’07 and N. Banerjee et al. ’04)  Simulation time is slow  Requires detailed RTL  Architectural-level power estimation  Interconnection network (Patel et al. ’97); model is not instantiated with architectural parameters  not suitable to explore tradeoffs in router microarchitecture  Uniprocessor power modeling (Wattch: Brooks et al. ’00 and SimplePower: Ye et al. ’00)  ORION models  Recently enhanced (i.e., ORION 2.0: Kahng et al. ’09)  Early-stage design space exploration

Gaps in Previous Works (1)  Dependence on underlying architecture / circuit implementation  Developed from a mix of template architectures / circuit implementations (cf. ORION 2.0)  Not accurate within an architecture-specific CAD flow  Useful for early-stage estimations (e.g., complementary to our approach)  Power and area estimations via parametric regression (Meloni et al. ‘07)  Regression process assumes certain functional forms  depends on the underlying architecture / circuit implementation  Does not consider implementation parameters (e.g., aspect ratio, etc.)  Model accuracy reduces once used across different architectures / circuit implementations  not suitable for efficient design space exploration

Gaps in Previous Works (2)  Not considering the impact of microarchitectural details  Parametric cycle-accurate traffic driven power models, without consideration of microarchitectural parameters (cf. NOCEE)  Power model with limited dependency on microarchitectural parameters; derived from synthesis results  Model applicability will significantly reduce w.r.t. energy design space explorations as the impact of contributing components are not fully realized  Our Goal: Develop a modeling framework that: (1) is architecture-independent, i.e., can be ported in any NoC component library, and (2) considers all the relevant microarchitectural details to enhance its applicability for efficient design space exploration

Improved NoC Router Power-Area Models implementation parameters Target frequency Chip aspect ratio Row utilization architectural parameters # of ports; # of buffers flit-width; # of VC voltage, frequency interconnect parameters device parameters LEF/Capacitance Tables/etc. … technology parameters req I req E req W req N req S grant I grant E grant W grant N grant S Arbiter out E out W out N out S in I in E in W in N in S out I Crossbar Buf E Buf W Buf N Buf S Buf I Link Source Link Source Write Control Request Signals  Built using router layout data  Closed-form models suitable for design space exploration  Provides significant accuracy improvement compared with existing models (e.g., ORION 2.0)

Outline Motivation  Implementation Flow and Design of Experiments  Modeling Methodology  Modeling Problem  Power Modeling  Area Modeling  Experimental Results and Discussion  Conclusions

Implementation Flow and Tools  RTL generation from architecture  Timing-driven synthesis, place and route flow  Use range of architectural and implementation parameters to capture design space  Nonparametric regression modeling Architectural Parameters Implementation Parameters Router RTL (Netmaker) Power / Area Reports Power and Area Models Model Generation (Multiple Adaptive Regression Splines) Place + Route (SOC Encounter) Synthesis (Design Compiler)

Design of Experiments  Netmaker (Cambridge)  fully synthesizable router RTL codes  Libraries: TSMC (1) 130G, (2) 90GP, and (3) 65GP  Tool Chain: Synopsys Design Compiler (DC), Cadence SOC Encounter (SOCE), Salford MARS 3.0  Experimental aces:  Technology nodes: {130nm, 90nm, 65nm}  Implementation parameters:  f clk = target clock frequency  ar = aspect ratio  util = row utilization  Architectural parameters:  fw = flit-width  n vc = number of virtual channels  n port = number of input/output ports  l buf = buffer length (#flit buffers / VC)

Outline Motivation Implementation Flow and Design of Experiments  Modeling Methodology  Modeling Problem  Power Modeling  Area Modeling  Experimental Results and Discussion  Conclusions

Modeling Problem  Accurately predict y given vector of parameters x  Difficulties: (1) which variables x to use, and (2) how different variables combine to generate y  Parametric regression: requires a functional form  Nonparametric regression: learns about the best model from the data itself  For our purpose, allows decoupling of underlying architecture / implementation from modeling effort  We use nonparametric regression to model power and area of an on-chip router → →

Multivariate Adaptive Regression Splines (MARS)  MARS is a nonparametric regression technique  MARS builds models of form:  Each basis function B i (x) can be:  a constant  a “hinge” function max(0, c-x) or max(0, x-c)  a product of two or more hinge functions  Two modeling steps:  (1) forward pass: obtains model with defined maximum number of terms  (2) backward pass: improves generality by avoiding an overfit model → → ^

Power and Area Modeling  Derive models for both dynamic and leakage power  Dynamic power is due to switching capacitance (c switching )  P dynamic = 0.5×α×c switching ×V 2 ×f clk  Leakage power is due to leakage current (i leak ) (subthreshold + gate)  P leakage = i leak ×V  Our modeling task:  To model dependence of (P dynamic / α×V 2 ×f clk ) on microarchitectural and implementation parameters  To model dependence of (P leakage / V) on microarchitectural and implementation parameters  Similarly, we model dependence of extracted area on microarchitectural and implementation parameters  Area is the sum of standard cell area

Example MARS Output Models (1) B 1 = max(0, n port - 5); B 2 = max(0, 5 – n port ); … B 34 = max(0, f clk - 200)×B 1 ; B 35 = max(0, 200 - f clk ) B 1 ; P dynamic = 0.5×α×(0.83 + 0.64×B 1 - 0.31×B 2 + 0.16×B 3 … - 0.003×B 33 + 0.003×B 34 - 0.003×B 35 )×V 2 B 1 = max(0, n port - 5); B 2 = max(0, 5 - n port ); … B 34 = max(0, n vc - 3)×B 27 ; B 35 = max(0, 3 - n vc )×B 27 ; P leakage = (0.13 + 0.04×B 1 - 0.04×B 2 + 0.01×B 3 … - 6.59E-5×B 34 - 5.53E-5×B 35 )×V Dynamic power model of a router in 65nm technology Leakage power model of a router in 65nm technology

Example MARS Output Models (2) Area model of a router in 65nm technology Total wirelength model of a router in 65nm technology (NEW) B 1 = max(0, n port - 5); B 2 = max(0, 5 - n port ); … B 34 = max(0, 24 - fw)×B 14 ; B 35 = max(0, f clk - 100)×B 15 ; Area = 0.02 + 0.01×B 1 - 0.004×B 2 + 0.003×B 3 … - 4.59E-6×B 34 - 1.23E-7×B 35 B 1 = max(0, n port - 5); B 2 = max(0, 5 - n port ); … B 33 = max(0, 1 - ar)×B 26 ; B 34 = max(0, util - 0.7)×B 8 ; WL total = 112269 + 64952.4×B 1 - 31881.3×B 2 … + 157.639×B 33 - 321.06×B 34  Closed-form expressions with respect to architectural and implementation parameters  Suitable to drive early-stage architecture-level design exploration

Outline Motivation Implementation Flow and Design of Experiments Modeling Methodology Modeling Problem Power Modeling Area Modeling  Experimental Results and Discussion  Conclusions

Model Comparison (1)  We compare our models against (1) models derived from parametric regression (Reg.), and (2) ORION 2.0 models  We assume baseline virtual channel (VC) with  FIFO buffers implemented as flip-flop registers  c switching ~ O(l buf ×fw×n port ); i leak ~ O(l buf ×fw×n port ×n vc )  Multiplexer tree crossbar  c switching ~ O(n 2 port ×fw); i leak ~ O(n 2 port ×fw)  VC “selection” arbitration (cf. Kumar et al. ’07)  c switching ~ O(n 2 port ); i leak ~ O(n 2 port ×n vc )  Buffer dynamic power does not change with n vc since the number of flits arriving at each input port is the same  Due to VC “selection” VC dynamic power becomes invariant to actual number of VCs  Requires modeler to have knowledge about the underlying architecture / circuit implementation

Model Comparison (2)  Comparison against ORION 2.0 w.r.t. microarchitectural parameters:  (1) #VC (n vc ), (2) flit-width (fw), (3) #port (n port ), and (4) buffer length (l buf )

Model Comparison (3)  Power estimation error reductions  Reg.: avg error 76.2% (24.4%  5.8%), max error 45.2% (108.4%  59.4%)  ORION 2.0: avg error 82.3% (32.8%  5.8%), max error 27.4% (81.8%  %59.4)  Area estimation error reductions  Reg.: avg error 79.4% (26.2%  %5.4), max error 45.5% (111.3%  61.8%)  ORION 2.0: avg error 83.8% (33.3%  5.4%), max error 28.3% (86.2%  61.8%) Metric Power ModelArea Model NewReg.ORION2.0NewReg.ORION2.0 min % error 130nm0.0117.6599.5260.00129.8810.121 90nm0.0087.2366.8650.00227.828.229 65nm0.0076.9217.730.00129.129.111 max % error 130nm62.0596.51103.260.72107.8104.118 90nm60.0762.3185.3560.15109.288.331 65nm59.41108.481.8161.84111.386.228 avg % error 130nm6.01223.4641.335.96126.3338.117 90nm5.65425.1130.225.01927.1132.566 65nm5.81724.4332.785.41126.2333.298

Variable Importance  Use MARS to identify relative variable importance using post- synthesis and post-layout data  n vc and n port are dominant parameters for post-synthesis and post-layout data, respectively  impact of missing layout information at post-synthesis  Multiplexer crossbar power is due to (1) multiplexers and (2) interconnection grid between input / output ports  With post-synthesis data interconnection data is ignored  crossbar power is modeled as O( n port ×log 2 n port )  With post-layout crossbar power is modeled as O( n 2 port ) Parameter Variable Importance (%) Post-SynthesisPost-Layout n port 92.98100 n vc 10095.44 l buf 88.4173.99 fw67.0364.81

Model Robustness  256 different router configurations  Five different scenarios to train / test our models  (1) s tr = 1/2, (2) s tr = 1/3, (3) s tr = 1/5, (4) s tr = 1/10, and (5) s tr = 64  For (1)-(4) we train models using a fraction s tr of the available data points, and validate them on the rest of the data points  To assess the sensitivity of the model to sample size  For (5) we use 64 (out of 256) data points to train the model, and validate it across all 256 available data points  To assess the generality of the model Metric Power Model s tr = 1/2s tr = 1/3s tr = 1/5s tr = 1/10s tr = 64 min % error0.006 0.0070.010.006 max % error12.41549.22681.11109.22477.32 avg % error1.6624.0127.99727.17721.23 Area Model min % error0.00040.0380.0420.0540.044 max % error14.10543.09978.24111.38461.24 avg % error1.8143.78.56825.16316.42

Recent Extensions  Have used same methodology to develop models for interconnect wirelength (WL) and fanout (FO)  Wirelength model  On average, within 3.4% of layout data  91% reduction of avg error vs. existing models (cf. Christie et al. ’00)  Fanout model  On average, within 0.8% of the layout data  96% reduction of avg error vs. existing models (cf. Zarkesh-Ha et al. ’00) (WL) (FO)

Outline Motivation Implementation Flow and Design of Experiments Modeling Methodology Modeling Problem Power Modeling Area Modeling Experimental Results and Discussion  Conclusions

Conclusions and Future Directions  Generally applicable modeling methodology that can leverage architectural parameters and RTL-to-layout implementation  Achieved accurate power and area models for on-chip router  Improvement over parametric regression models  Power: 76.2% (45.2%) reduction of average (maximum) error  Area: 79.4% (44.5%) reduction of average (maximum) error  Improvement over ORION 2.0  Power: 82.3% (27.4%) reduction of average (maximum) error  Area: 83.8% (28.3%) reduction of average (maximum) error  Ongoing work  Maximum f clk modeling w.r.t. architectural and implementation parameters  Other architectural building blocks (DSP cores, DesignWare library, …)  Power, performance and cost estimators for 3-D design space exploration

Improved On-Chip Analytical Power and Area Modeling Andrew B. Kahng Bill Lin Kambiz Samadi University of California, San Diego January 20, 2010.

Similar presentations

Presentation on theme: "Improved On-Chip Analytical Power and Area Modeling Andrew B. Kahng Bill Lin Kambiz Samadi University of California, San Diego January 20, 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improved On-Chip Analytical Power and Area Modeling Andrew B. Kahng Bill Lin Kambiz Samadi University of California, San Diego January 20, 2010.

Similar presentations

Presentation on theme: "Improved On-Chip Analytical Power and Area Modeling Andrew B. Kahng Bill Lin Kambiz Samadi University of California, San Diego January 20, 2010."— Presentation transcript:

Similar presentations

About project

Feedback