Presentation on theme: "Methodology for High-Speed Clock Tree Implementation in Large Chips"— Presentation transcript:
1Methodology for High-Speed Clock Tree Implementation in Large Chips Ravinder RachalaAaron GrenatPrashanth VallurChristopher AngJanuary 31, 2013
2Advantages of Custom Clock Distribution Low skewSmaller AOCV timing uncertainty compared to full CTSCustom buffers are more tolerant to OCV, IR drop, supply noiseThe plot here displays a scenario where increased skew would require boosting voltage to achieve target Fmax. Effectively skew translates to higher power (dynamic and leakage) for meeting a target frequency.Low SkewHigh Skew
3OLD METHODOLOGY – CLOCK SPINE FRIENDLY FPLAN PLLClock Spine MacrosShowing here a typical CPU floorplan, regular and very constrained problem. Clock trees not cutting into too many blocks where blockages from clock buffers would cause congestion. Same macro can be programmed with varying final buffer strengths as the aspect ratio is the same.Regular and repetitive structure like the above floorplan is conducive to thin, long clock macro structures like above. Here we build 2 unique types of clock macros and stamp them. So, custom macro effort is relatively small compared to more complex floorplans.
4OLD FLOW - Clock Spine Topology in complex floorplan In more complex floorplans like above we would end up needing too many custom clock spine macros which are resource intensive and hard to converge in time for chip tapeout.Traditional clock spine macro style is not scalable for today’s complex chips
5ISSUES with OLD methodology Very resource intensive. Increasing number of SOCs in roadmap makes this even more challengingArea taken by the clock trees is badly utilized …<10%Increasing size of the macros (of the order of ~20mm) runs risk of not converging through the custom macro/IP build flowFloorplan challenges in accommodating the clock macros and minimizing the number of unique macros typically consumes lot of resource energy and timeRe-use of clock macros across projects is heavily restricted by even small floorplan changes between projects
6TMAC Flow : New Methodology Clock macros are broken down to cells (called as TMACs: Tiny-MACros) that will be flat instantiations at IP levelConnection between the TMACs is done in overlay (or RDL - Route Distribution Layers)TMACcellsClockMacro1mm
8PRIOR work: example Tile/RLM IP floorplan Conduit - 1 Vtree - 1 PLLTile/RLMConduit - 1Vtree - 1Htree - 8Total uniqueclock macros= 10IP floorplan(All 8 flavors are delay-matched)Bad skewDriving large areas of the design from a corner (i.e., huge cap on the buffer, big current through the wire) causes EM, self-heating issuesLong distribution wire susceptible to ringing/reflections (parasitic inductance)
9CLOCK COVERAGE IS BETTER IN New Methodology Tile/RLMIP floorplanTMACOverlay 1 clock spineAll TMAC cells connected in overlayMore clock coveragePLLTMAC
10TMAC Flow : New Methodology BENEFITS Entire distribution is contained in one clock spineReduces number of circuit and layout resourcesFrees up area between the TMACs for RLMs/TilesTMAC library of cells built once per technology node (e.g. GF 28nm), reused across all projects in that process technologyFloorplan changes can be easily accommodated even in late stages of design cycleProvides more complete and robust clock coverage. Bad skew zones are avoided, reliability concerns minimizedInstance swapping (Sizing clock mesh drivers for power and performance optimization) can be done easily based on the clock mesh loadCreates full-custom quality clock spine network with significantly “less” effort
11GRID CAP OPTIMIZATION, SDF for SKEW ANNOTATION Clock grid optimization techniques - reduced clock metal capacitance (by ~45%)Classic clock mesh pruning methods like on-demand-gridPushing VIA stack into the MPCTS (Multi-Point CTS) buffer.Providing clock arrival times at each MPCTS entry point on the mesh (SDF file) for full-chip timing flowNew MPCTS buffer cell. Connection from M2 pin to MH layer is built into the cell. Pin is elevated to MH layer. New cell is the same size as standard cell.CLK (M2)CLK (MH)Clock mesh (MH Layer)CLK (M2)Standard MPCTS buffer cell. Auto router built connection from ‘CLK’ pin to ‘MH’ clock grid route.Clock mesh (MH Layer)All of this route cap is saved. Skew from circuitous route is avoided.
12TMAC METHODOLOGY : FLOW CHART Import IP/SOC floorplan (DEF or GDS) into Cadence Virtuoso layout XLMerge clock spine DEF with other overlay DEF (top layer power grid + clock mesh etc.) – First EncounterPush down clock design (distribution + mesh/grid) into floorplan views for RLM/tiles to see for CTS buffer placement etc.Draw full clock spine in Cadence Virtuoso XL (schematic, layout)Extract clock routes (StarRCXT) at IP/SOC top level and run timing using Primetime.Export entire clock spine layout to a DEF file (using internal flow)
13Custom design data to DEF conversion FLOW CHART def writergdsiicdldeflvsannotatedgdsii filecrossreference filesinternaldatabasedata processing toolscomponent cell list
14CLOCK GRID INSERTION and SDF GENERATION: FLOW CHART Top level script prunes MH route completely and inserts back shortest possible segment to connect CTS entry buffers to nearest MV layerDraw clock mesh/grid routes in FE (Spec from clock circuit team – route width, space, shielding)Run CES flow. Skew (clock arrival times – SDF file) is reported to full-chip timing flow. Here clock routes are analyzed for EM pass/fail criteria as well.Push down the mesh into the tiles. CTS buffer placement flow is run. Tiles close placement, routing and timing..All tile DEFs are exported for full clock mesh extraction and spice simulation flow (CES)Extract clock distribution routes at IP/SOC level and run full-chip STA timing (Primetime).
15Benefits proven in recent AMD SOCs Less resource needs32nm SOI APU Graphics IP: 7 clocks. ~30 clock macros. 4 circuit and 4 layout resources28nm APU Graphics IP: 9 clocks: 1 clock spine DEF. 1.5 circuit and 1 layout resourceArea savings32nm SOI APU Graphics IP area : 98 mm clock macro area: 1.21 mm2 1.23%28nm APU Graphics IP area: 131 mm clock macro area: 0.18 mm2 0.12%Floorplan flexibilityWith the new methodology (TMAC flow), high-speed clock distribution can be designed to fit into any floorplan.E.g.: We were able to deliver clock distribution design to a server SOC in ¼ the time it takes in the old clock spine macro flow.Reuse across projectsTMAC library (clock buffer cells etc.) developed for a technology process are being leveraged for multiple APU projects.