Clock Clustering and IO Optimization for 3D Integration Samyoung Bang*, Kwangsoo Han ‡, Andrew B. Kahng ‡† and Vaishnav Srinivas ‡ ‡ ECE and † CSE Departments,

Clock Clustering and IO Optimization for 3D Integration Samyoung Bang*, Kwangsoo Han ‡, Andrew B. Kahng ‡† and Vaishnav Srinivas ‡ ‡ ECE and † CSE Departments, UC San Diego, La Jolla, CA 92093 *Samsung Electronics Co. Ltd, Hwaseong-si, South Korea eva.bang@samsung.com, {kwhan, abk, vaishnav}@ucsd.edu

2 Outline Motivation Power, Area and Timing Model P&R and Timing Flow Experimental Results Conclusion

3 Motivation For 3D integration with large bandwidth needs between dies, choice of clocking options need to be made upfront Tradeoff between area and power needed upfront Affects floorplanning choices Serializer 3DIO PLL Deserializer 3DIO PLL PLL Serializer 3DIO PLL Deserializer 3DIO PLL PLL

4 Key Choices for Clocking Options Local clustering Partition a given region into sub-regions Clock synchronization scheme Synchronous Source-synchronous Asynchronous 3DIO frequency  # of 3DIO To enable design space pathfinding/exploration: Power/Area/Timing model based on total bandwidth, clustering, synchronization scheme, 3DIO frequency Combine clocking and 3DIO power/area/timing

5 Clock entry point Cluster 3DIO array Data path The layout of the bottom die 3DIO Clustering Localize the clock tree of the 3D interconnect Advantages when number of cluster increase: Size of cluster clock tree↓ (smaller skew, jitter) Shorter data paths to 3DIO array at the center of each cluster Enables efficient clocking schemes (forwarded clock, asynchronous) Disadvantages when number of cluster increase : Overhead to synchronize between clusters on top die Overhead in cluster clock 3DIO per cluster

6 Synchronization Schemes for 3DIO Clocking Synchronous Cluster clock tree is balanced to all F/Fs on both the bottom and the top die Simplest clocking scheme (similar to on-die) Vulnerable to inter-die process/voltage variation (large skews) Source-synchronous Forwarded clock from one die to another No skew balancing needed across two dies Require balance delays (T b ) within each die on the data path to match the clock insertion delay Asynchronous Separate clocks on each die FIFO to help clock domain crossing Obtain much smaller number of 3DIOs due to higher speeds achievable with embedded clock and CDR techniques Asynchronous clocking Serializer Deserializer IO-clock Cluster clock IO-clock Synchronous clocking Launch FFs Capture FFs Data path Bottom Top Source-synchronous clocking Bottom Top Launch FFs Capture FFs Data path TbTb

7 Our Work Given the choices of clock synchronization schemes, number of clusters and 3DIO frequency, find maximum bandwidth for the 3D interconnect given a max power and area constraints. Optimal Clocking scheme for Max BW Optimal Clocking frequency for Max BW Max power constraint Max area constraints Synch. Source-synch. Asynch. Max power constraint Max area constraints Max Achievable BW Max power constraint Max area constraints Max power constraint Max area constraints Optimal number of clusters for Max BW

9 3DIO/CTS Directed Graph Primary inputs are indicated by circle Rectangles are determined by the primary inputs Solid and dotted arrow indicates positive and negative correlation Estimate the rounded rectangles as analytic expressions #Clusters Freq. Clocking scheme Region Area BW WNS Skew outcome, clock ins. delay Area Power # FFs IO Freq. Per-IO power/area # 3DIO Max skew/transition Jitter Input Deterministic Est. outcome Estimated Increase Decrease Clock WL Clock buf. area Data WL Data buf. area

10 Clock Wirelength Hierarchical approach to estimate clock wirelength Assume clock tree is well balanced because FFs are uniformly distributed over the region area Length of Steiner minimal tree over N points uniformly distributed within a given region A reg is proportional to Total clock wirelength is Notation Depth of clock tree (i == 0 for clock source) Number of cluster Total number of flip-flops Fitted coefficients i = 0 w0w0 i = 1 w1w1 FFs i = 2 w2w2 Cluster clock tree Global clock tree

11 Clock Buffer Area Tellez and Sarrafzadeh propose a method to insert the minimum number of buffers under a given transition time (T max_tran ) constraint Linearize the problem by using the concept of maxinum capacitance (C max )  Any buffer stage i with stage cap C i < C max will have T i_tran < T max_tran Using C max, we estimate the number of clock buffers (N cbuf ), Kashyap et al. discuss transition time degradation and C max can be expressed as follows,  Total clock buffer area is Wire (max length = W max ) T0T0 T max_tran

12 Data Wirelength and Data Buffer Area Data path wirelength is proportional to the number of data wires and the cluster dimension Distribution exists based on sink placement wrt 3DIO cluster For data buffer area, we use a similar concept to clock buffer area estimation Need to consider each data path separately  Cannot use total wirelength Need minimum number of data buffers to meet hold timing

13 3DIO/Overall Power and Area 3DIO power and area models are based on CACTI-IO Overall (3DIO+clocking) power and area are Switching power Internal and leakage power IO power

15 3D P&R Flow - Synchronous Synchronous Synthesize the cluster clock tree on the top die first to balance the clock tree on both dies Extract maximum clock insertion delay (T clk1 ) Propagate the data path delay (T data ) for the routing on top die Synchronous scheme T clk1 Propagated Clock 3DIO The layout of top dieThe layout of the bottom die T data

16 3D P&R Flow – Source-synchronous and Asynchronous Source-synchronous Synthesize the clock tree and route on the bottom die, and separately synthesize the clock tree only for the top die Extract balance delay T b (i.e., T clk1 ) for each capture FF and annotate the delays to the corresponding data 3DIOs Asynchronous Run traditional 2D flow on both dies separately Source-synchronous scheme Propagated Clock 3DIO The layout of top dieThe layout of the bottom die T clk1 Annotate T b (i.e., T clk1 )

17 Conventional 2D STA vs. our 3D STA We focus on inter-die variation, and do not consider intra-die variation which can be comprehended by timing derate or OCV Two process corners {BC, WC} for inter-die variation Assign the same corner on the paths on the same die Report worst setup WNS out of four combinations (i.e., BC-BC, BC-WC, WC-BC, WC-WC ) of corners Conventional 2D STA (without inter-die variation) Our 3D STA Setup slack = T per – T su – T {c2q, WC } – T {data1, WC } – T {data2, WC } + (T {capture, BC } – T {launch, WC } ) T launch T c2q T data1 T capture T data2 Buffer on bottom die Buffer on top die FF on bottom die FF on top die Setup slack1 = T per – T su – T {c2q, BC } – T {data1, BC } – T {data2, BC } + (T {capture, BC } – T {launch, BC } ) slack2 = T per – T su – T {c2q, BC } – T {data1, BC } – T {data2, WC } + (T {capture, WC } – T {launch, BC } ) slack3 = T per – T su – T {c2q, WC } – T {data1, WC } – T {data2, BC } + (T {capture, BC } – T {launch, WC } ) slack4 = T per – T su – T {c2q, WC } – T {data1, WC } – T {data2, WC } + (T {capture, WC } – T {launch, WC } ) slack = min (slack1, slack2, slack3, slack4)

19 Experimental Setup P&R tool is Synopsys IC Compiler I-2013.12-SP1 Timing analysis tool is Synopsys PrimeTime H- 2013.06-SP2 We use a 65nm TSMC library Design of experiments Bandwidth (10 – 200 GB/s) Region area (25 – 100 mm 2 ) 3DIO clock frequency Synchronous (100 – 2000 MHz) Source-synchronous (1500 – 4000 MHz) Asynchronous (3500 – 8000 MHz) Number of clusters (1 – 25) We select four data points for each parameter  256 design implementations for each clocking scheme

20 Model Fitting Approach We use Artificial Neural Network (ANN) model for our fit, guided by the directed graph Iteratively progress through the directed graph to fit each node Clock wirelength Data wirelength Clock buffer area Data buffer area 3DIO power/area Total Area/Power/WNS We use the F max for the timing model (instead of WNS) Multiple runs with different training, validation and test data sets  Improved generality and robustness of the resulting models

21 Area, Power and Timing Model Results Min-Max error within +/-20% For synchronous scheme, tool inserts large number of hold buffers due to inter-die variation  Larger error Mean error within +/-0.5% Area Power F max

22 Design Space Results Max BW: Figure shows the iso-bandwidth curves Vertical and horizontal walls show min power/area required to hit a bandwidth requirement Clocking scheme: The asynchronous scheme is area-efficient The synchronous scheme is power-efficient The source-synchronous scheme provides a valuable tradeoff between power and area along the knee of the iso-bandwidth curve. The interesting tradeoffs between the schemes occurs along these knee points as we change the power/area constraint tradeoffs. Max BW Optimal clocking scheme for Max BW

23 Design Space Results Cluster clock frequency: As power constraint gets tighter, frequency goes down As area constraint gets tighter, frequency goes up Source-synchronous schemes provide benefits at higher cluster frequencies The asynchronous scheme provides a way to keep the cluster frequency down but still have high 3DIO frequency, through serialization Number of clusters: Not monotonic along edges of hypercube and clocking scheme boundaries Also sensitive to the total region area Optimal # of Clusters for Max BWOptimal Cluster Frequency for Max BW

25 Conclusion We have developed a power, area and timing model for 3DIO and CTS that includes clustering and three different clock synchronization schemes (synchronous, source-synchronous, asynchronous) Our model estimates power, area and timing within 20% error across a large range of bandwidths, region areas, numbers of clusters and 3DIO frequencies Our modeling methodology will enable architects to study and optimize the design space upfront Key takeaways: Iso-bandwidth lines identify min area/power required to hit a particular BW Clocking scheme tradeoffs are interesting along the knee of iso- bandwidth lines Cluster frequency for asynchronous schemes can be kept low while still reducing the number of 3DIO due to serialization

26 Future Work Extend our model to be aware of Placement uniformity Technology dependence Datapath logics More comprehensive STA including intra-die variation Blockages Asymmetric clustering Different 3DIO placement Serial 3DIO circuit options for asynchronous scheme 2.5D (interposer-based) design

Thank you

BACKUP

29 Synchronous All end points on both dies are synchronized Colored FFs are uniformly distributed over the region Non-colored FFs are placed right next to the 3DIO array Clock tree is vulnerable to the inter-die variation Use DDR to minimize number of 3DIOs Two factors affect to determine max 3DIO clock frequency (F IO ) Clock skew due to the inter-die variation Jitter Increase #clusters  increase max F IO because clock tree becomes more robust to the inter-die variation BW (GB/s) Region Area (mm 2 ) N cluster F max (MHz) F IO (MHz) 122516401280 11.2525 9001800 121001300600 11.25100256001200 50.025251460920 50.62525 9001800 200.251001300600 202.5100256001200

30 Source-Synchronous Forward clock one die to the another die For any paths across dies, the launch and capture path delays from source to 3DIO at the bottom die are balanced  no inter-die variation Require balance delay T b to compensate clock insertion delay T clk1 Two factors to determine max 3DIO clock frequency (F IO ) Skew between T b and T clk1 due to the intra-die variation Jitter BW (GB/s) Region Area (mm 2 ) N cluster F max (MHz) F IO (MHz) 12.0952518201640 10.62525 17003400 50.022518201640 46.87525 15003000 1210015001000 151002512002400 2001001350700 1951002512002400

31 Asynchronous BW (GB/s) Region Area (mm 2 ) N cluster F max (MHz) F IO (MHz) 11.92517005600 25 10008000 49.72517005600 4025 8006400 1210014003200 20100258006400 20010011251000 200100255004000 Use FIFO (1:8 serializer, 8:1 deserializer) to separate clock domain No inter-die variation Minimize the number of 3DIOs Require PLL for cluster clock for the top die and IO clock for both dies Large power overhead One factor to determine max 3DIO clock frequency (F IO ) Jitter

32 Flow of Synch. Clocking Schemes BottomTop 0.307ns (bc) 0.618ns (wc) 0.089ns (bc) 0.200ns (wc) 0.125ns (bc) 0.247ns (wc) 0.147ns (bc) 0.306ns (wc) BottomTop 0.307ns (bc) 0.618ns (wc) 0.089ns (bc) 0.200ns (wc) 0.125ns (bc) 0.247ns (wc) 0.147ns (bc) 0.306ns (wc) Delay to balance the clock insertion delays across dies 0.307 - 0.125 = 0.182ns (bc) 0.618 – 0.247 = 0.371ns (wc) Input delay to prevent unnecessary hold buffer insertions 0.307 + 0.089 – 0.182 = 0.214ns CTS, CTO and Route on bottom die Custom placement on bottom/top dies CTS on top die Extract delay 1 2 1 2 BW: 12GB/s A reg : 81mm 2 n c : 4 f clus : 1000MHz Cluster buffer

33 Flow of Synch. Clocking Schemes CTS on top die Extract balance delay CTO and Route on top die 1 2 3 3 STA 4 Run CTO and route at worst corner considering hold time and clock uncertainty Top 0.247ns (wc) 0.214ns Top 0.247ns (wc) 0.214ns 4 BottomTop 0.618ns (wc) 0.200ns (wc) 0.125ns (bc) 0.306ns (wc) BottomTop 0.307ns (bc) 0.089ns (bc) 0.247ns (wc) 0.147ns (bc) Setup: 0.5 (half cycle) + 0.802(t clk ) – 0.075 (t unc ) - 0.008 (t s ) – 1.195 (t data ) = 0.024ns Hold: 0.683(t data ) - 0.576 (t clk ) - 0.060 (t unc ) - 0.030 (t h ) = 0.017ns 0.371ns (wc) 0.071ns (bc) 0.140ns (wc) 0.182ns (bc) Cluster buffer

34 Flow of Synch. Clocking Schemes BottomTop 0.618ns 0.200ns 0.247ns 0.306ns BottomTop 0.618ns 0.200ns 0.247ns 0.306ns CTS, CTO and Route on bottom die Custom placement on bottom/top dies CTS on top die Extract delay 1 2 1 2 BW: 12GB/s A reg : 81mm 2 n c : 4 f clus : 1000MHz Balance the delay from clock source to data 3DIO and the delay from clock source to clock 3DIO 0.618 + 0.200 = 0.818ns Annotate “balancing delay” 0.247ns Cluster buffer

35 Flow of Source synch. Clocking Schemes CTS on top die Extract balance delay CTO and Route on top die 1 2 3 3 STA 4 Run CTO and route at worst corner considering hold time and clock uncertainty Top 0.247ns (wc) 0.247ns Top 0.247ns (wc) 0.247ns 4 BottomTop 0.618ns 0.200ns 0.247ns 0.306ns BottomTop 0.618ns 0.200ns 0.247ns 0.306ns Setup: 0.5 (half cycle) + 1.371 (t clk ) – 0.075 (t unc ) - 0.008 (t s ) – 1.471 (t data ) = 0.317ns Hold:1.471 (t data ) - 1.371 (t clk ) - 0.060 (t unc ) - 0.030 (t h ) = 0.010ns 0.818ns 0.100ns 0.818ns 0.247ns Cluster buffer Balancing delay

Clock Clustering and IO Optimization for 3D Integration Samyoung Bang*, Kwangsoo Han ‡, Andrew B. Kahng ‡† and Vaishnav Srinivas ‡ ‡ ECE and † CSE Departments,

Similar presentations

Presentation on theme: "Clock Clustering and IO Optimization for 3D Integration Samyoung Bang*, Kwangsoo Han ‡, Andrew B. Kahng ‡† and Vaishnav Srinivas ‡ ‡ ECE and † CSE Departments,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clock Clustering and IO Optimization for 3D Integration Samyoung Bang*, Kwangsoo Han ‡, Andrew B. Kahng ‡† and Vaishnav Srinivas ‡ ‡ ECE and † CSE Departments,

Similar presentations

Presentation on theme: "Clock Clustering and IO Optimization for 3D Integration Samyoung Bang*, Kwangsoo Han ‡, Andrew B. Kahng ‡† and Vaishnav Srinivas ‡ ‡ ECE and † CSE Departments,"— Presentation transcript:

Similar presentations

About project

Feedback