Presentation on theme: "Traditional SOC Design Flow"— Presentation transcript:
1Traditional SOC Design Flow Key Problem: Timing assumption during prelayout synthesis widely differs from the post layout reality.This happens because the interconnect delay dominates the overall propagation delay in DSM (Deep Sub-Micron) technologies.As a result getting a timing closure becomes a challenge.Source: Advanced ASIC Chip Synthesis. 2nd Ed. Himanshu Bhatnagar. Kluwer Academic Publishers
3Design Compiler Setup Files .synopsys_dc.setupLibrary pathsCompany wide, project wide design environment related variables and commandsUNIX variablesThree files at three locations. All three are read in the following orderSynopsys root - $SYNOPSYS/admin/setupAffects all users. Only system adminstrator can modify this. In small startups with only single ASIC project, this serves as the place to enforce project wide discipline.Home DirectoryContent affects all DC activities. Project wide enforcement could happen at these level if the designer is involved in a single project (less likely).Working DirectoryAffects the current invocation of DC. If a person is working on more than one Synopsys projects (more likely), then the project wide enforcement should happen at this level. One working directory for each project.Repeated commands are overridden
4Libraries & Search Path Technology LibraryCreated by ASIC vendor in Synopsys format – which is now an open standard.Cells are defined by their names, function, timing, net delay, parasitic information, units for time, resistance, capacitance etc.Target Librarya technology library that Design Compiler maps to during optimization.Link LibraryThe technology library that contains the definition of the cells used in the mapped design. In principle should be the same as target_library unless a technology translation is being performed.Symbol LibraryDefinition of graphics symbols. Cells in Symbol Library must matchDesignWare LibraryA DesignWare component library is a collection of reusable circuit-design building blocks that are tightly integrated into the Synopsys synthesis environment.GTECH LibraryThe GTECH library is the Synopsys generic technology library. It is technology-independent and included with Design Compiler software.GTECH parts are Synopsys unmapped representations of Boolean functions (library cell placeholders). GTECH instantiation allows for a technology-independent HDL description and the accuracy of instantiation.Search_pathIf the library variables only specify file names, search_path is used to locate libraries. By default points to current working directory and $SYNOPSYS/libraries/syn
5Synopsys Design Objects A circuit that performs one or more logical functionsCellAn instance of a design or library primitive within a designReferenceThe name of the original design that a cell instance points toPortThe input or output of a designPinThe input or output of a cellNetA wire that connects ports to ports or ports to pinsClockA timing reference object to describe a waveform for timing analysis
21Distributed RC wire model This is also known as Elmore Delay modelSource: MIT. Course Lecture L
22Manual insertion of Repeaters Source: MIT. Course Lecture L
23Lumped RC wire modelSource: MIT. Course Lecture L
24Estimate the rise timeSource: MIT. Course Lecture L
25The factor 2.2 comes from 90% Vdd swing loge(0.9Vdd / 0.1Vdd) Width of transistor is found by multiplying the scaling factor (16/8/2/1) with the minimum width of transistor which is 0.5 mm.Multiply Cg,N/Cg,P/Cd,N/Cd,P with the width of the transistor to get the drain/gate capacitances for P and N transistors.Wider transistor more capacitanceDivide Reff,N/Reff,P with the width of the transistor to get the Resistance for the N and P transistors.Wider Transistor Less resistanceThe factor 2.2 comes from 90% Vdd swingloge(0.9Vdd / 0.1Vdd)The sheet resistance (0.07) is for unit square.Since the wire width is 0,25mm. resistance for 1 mm X 0.25 mm wire is 0.07/0.25. This factor is multiplied by the length 250 mmThe wire capacitance is made up of two parts: Bottom (area) capacitance found using 250 X 0.25 (area) X CA,M2.Side capacitance is found by multiplying length 250 XCL,M32Source: MIT. Course Lecture L
26Constraints Technology, Operating and Manufacturing Constraints Max rise time, max capacitanceOperating Conditions –Vdd, TemperatureDrive current, LoadProcess VariationsFast corner, Slow cornerPhysical DesignAntenna rulesOptimisation ConstraintsPerformance – clockAreaPower
27Generic Synthesis Flow DesignCreate a solutionTechnology, Operating & Manufacturing ConstraintsOptimisation ConstraintsEvaluate the solutionAnalysisConstraints Met
28Static Timing Analysis (STA) Exhaustively verifies thatthe timing constraints (clock) are met for a designfor given technology (Standard Cell Library) anda set of specified operating conditionsLimitations of the alternative – SimulationNot ExhaustiveAccuracyRTLGate LevelSDF back annotationDependent on STACircuit Level SPICE simulation are impracticalTime (STA also takes time, but is bounded)PROCESS (clk)BEGINIF rising_edge (clk) THENs <= a * b;END IF;END
29Timing Models - Accuracy UntimedTransaction Level - SystemCMultiple CyclesBus Transactions, Transmit/Receive, Encode/DecodeCycle Accurate – RTLWhat happens in each clock cycle is accurately knownGate Level – Event DrivenPhysical details of computation, storage and interconnect operations knownDelay in wire is not knownClock is idealLayout LevelDelay in wire knownClock is realRelative position of standard cell is known
31Path Delay Calculation The intrinsic delays and the slews are characterised using SPICE simulation by sweeping many parameters that affects the Intrinsic delay and SlewAll the paths are exhaustively coveredLibrary and DesignDelay ComputationThrough GateThrough WireDelay and SlewAt Gate OutputAt Next Gate InputDBACEnvironment Conditions for Analysis
32Paths & Path Groups Paths Start point: Input ports or clock pins of sequential devices andEnd point: Output ports or Data input pins of sequential devices.Path groupsPaths are organised in groups identified by clocks controlling their endpoints.
33Timing Arcs positive unate timing arc: negative unate timing arc: Combines rise delays with rise delays, and fall delays with fall delays. An example is an AND gate cell delay or an interconnect (net) delay.negative unate timing arc:Combines incoming rise delays with local fall delays, and incoming fall delays with local rise delays. An example is a NAND gate.nonunate timing arc:Combines local delay with the worst-case incoming delay value. Nonunate timing arcs are present in logic functions whose output value change cannot be predicted by the direction of the change on the input value. An example is an XOR gate.Accuracy of estimates is criticalIntrinsic Delays are accurate after logic synthesisSlew and Net Delays are estimated and known accurately only after physical synthesis
35Factors Affecting Delay and Slew Load on the GateLoad of all the inputs that this output has to driveLoad of the interconnect wiresTri-stated wiresInput SlewTransition time at the previous gateThe interconnectPrimary input – drive strength, driver cell
36Constraints Technology Constraints Design Constraints Max Transition Max FanoutMax CapacitanceMin CapacitanceDesign ConstraintsSet LoadSet Drive (inverse of resistance)
37If load is not specified, the synthesis tool assumes zero load Technology Constraint; Cannot be relaxedDesign ConstraintAZ3Z2Z15set_loador set_driveset_driving_cellIf drive or driving cell is not specified, the synthesis tool assumes infinite drive strengthIf load is not specified, the synthesis tool assumes zero load
38Interpolation and Extrapolation Piece Wise Linear ModelSlewLoadS1S2L1L2D11D12D21D22LSD1D2D
39Process, Voltage, Temperature (PVT) Variation & Operating Conditions DelaybestnominalworstVoltageTemperatureOperating ConditionsName Library Process Temp Volt Interconnect ModelWCCOM my_lib worst_case_treeWCIND my_lib worst_case_treeWCMIL my_lib worst_case_treeBCCOM my_lib best_case_treeBCIND my_lib best_case_treeBCMIL my_lib best_case_tree
40PVT Variation: An Example Consider a minimum size NMOS device in a 1.2 mm CMOS process. VGS =VDS = 5VThe nominal saturation current for the device size W = 1.8 mm, Leff = 0,9 umNow consider the variation in the following parameters:25 % variation in Threshold voltage – Vt10 % variation in transconductance k’n mainly due to variation in oxide thickness.±0.15mm (about 10 %) variation in W and L. Variations in W and L are uncorrelated as they are±0.5V (10%) variation in power supply voltageSpeed of device is proportional to the drain current and can thus result in variation of the speed of the circuit.
41Derating Libraries are characterized for various operating conditions Further characterisation is done to see how the delay model responds to change in process, voltage and temperature. This is done by holding two parameters constant and sweeping the third.This yields derating factors for Process, Voltage and Temperature
42Sequential Arcs Timing relationship between two input pins two consecutive events on the same input pinPulse WidthSetupHoldRecoveryRemoval
43Pulse Width Width of High and low phases of clocks Width of Active level of asynchronous inputs like resetrst_nPulseWidthRequirementNot met. Reset mayhave no effect
44SetupData should be stable setup time before the arrival of clock edge.What happens if the setup time is violated ?clkSetup RequirementNot met. New datamay not get latcheddata
45Hold Data should be stable hold time after the arrival of clock edge. What happens if the Hold time is violated ?clkNot met. Old data maynot get latcheddataHoldRequirement
46Recovery and RemovalMinimum time between de-assertion of an asynchronous control signal and the next active clock edgeMinimum time between an active clock edge that an asynchronous control signal should remain assertedrst_nRecoveryRequirementNot met. clk maynot have effectclkclkRemovalRequirementNot met. clk mayoverride rst_nrst_nCan be formulated as a setup checkCan be formulated as a hold check
47What is the reason for setup and hold bcVin1, Vout2Vin2, Vout1Vin2 = Vout1Vin1Vout1Vin2Vout2cbaVin1 = Vout2
48Transistor Level Schematic of a D-Flop http://www. edn
52Generalizing Setup & Hold Constraints Setup ConstraintBoundary of the FlopAssume C1 is zeroclk reaches F1 before data has arrived at F1 and registers wrong dataTo avoid this, data should stabilize D1 time before the arrival of clk.In reality, C1 is never zero, so data should stabilize D1-C1 time before the arrival of clk.As there are multiple D1 paths and multiple C1 paths, the complete and safe setup constraint is max (data path delays) – min (clock path delays)Delay D1dataF1Delay C1clkHold ConstraintAssume D1 is zeroData reaches F1 before clk has arrived at F1. When the clk arrives, new data has overwritten the previous data.To avoid this, data should remain stable C1 time after the arrival of clk.In reality, D11 is never zero, so data should remain stable C1-D1 time after the arrival of clk.The complete and safe hold constraint is max (clock path delays) – min (data path delays)
53Negative Hold data clk F1 Delay D1 Delay C1 Boundary of the Flop Typically clock paths are well buffered and fasterThere can be substantial data path delay, especially in scan flopsmax (data path delays) – min (clock path delays) is always positive. This implies that Setup constraint is never negativemax (clock path delays) – min (data path delays) can be negative. This implies that Hold constraint can be negativeclkNegative Hold – Seen At Device InterfaceAt Device InterfaceAt Latching ElementdataStableNewSetup + Hold (cannot be negative) =Max(clock path) + Max(data path) –Min(clock path) – Min(data path)
54Specifying Input Delay Good design practice mandates that inBlock does not have a combinatorial logic (”m”) driving outputThese days ”m” is more likely to be the result of global interconnect delay.Early floorplanning is a good way to estimate the delay due to ”m”If floorplanning is not done a good bet is 50-60% of the clock cycleCharacterize command automatically calculates input delay from parent designset_input_delay -clock Clock 8 “data_in_2”
56General Timing Constraints clkF1C1F3F2C0C2C3O1C4I2O2O2 = TI2 + C4Four kinds of path groups exist:Input to Output, e.g., I2 to O2Input to Register, e.g, I1 to F1Register to Register F1 to F2Register to Output F3 to O1TI1 + C0 ≤ P – S1TI1 + C0 ≥ H1Setup Slack: P- S1- TI1- C0Hold Slack: TI1 + C0 - H1Setup and Hold Slacks should be positiveTI1, TI2 are input delaysDQ1, DQ2 and DQ3 are clk-to-Q delaysS1, S2 and S3 are setup constraintsH1, H2 and H3 are hold constraintsC0-C3 combinatorial delaysP is the clock PeriodDQ1 + C1 ≤ P – S2DQ2 + C1 ≥ H2Setup Slack: P - S2 - DQ2 - C1Hold Slack: DQ2 + C1 – H2
59Clock Skew Clock Skew in Alpha Processor The basic assumption in synchronous system is that all the sequential elements in the design sample their input at the same time, marked by a clock signal. In reality, the clock signal does not arrive at the sequential elements at the same time. The difference in time between the reference clock signal and the local clock signal at a sequential element is called the clock skew.In fact clock skew would not be a problem if the clock signal was uniformly delayed at all the sequential elements. It is the non-uniform delay of the clock signal that creates the problem. The delay depends on the distance of the sequential element from the clock source and the local load.The primary reason for the delay is the large amount of load seen by the clock signal. The load consists of all the sequential elements in the design and clock net itself which behaves as a distributed RC line (or higher order models ) and can be several cms long in a large chip.The total capacitance of a single clock line easily measures hundreds of pF and can easily reach into nF range. The total clock capacitance of the Alpha processor equals 3.25 nF, which is 40% of the total switching capacitance of the entire chip.
63Clock Skew and Sequential Circuit Performance Each synchronous module is composed of combinational logic CL and a Flop and is characterised by six timing parameters: The min. and max. propagation(pg) delays of the register: tr,min, tr,max and combinational logic: tl,min, tl,max. The propagation delay of the interconnect ti and the local clock skew tf.The max pg. delay corresponds to the time taken by the slowest output to respond to any transition at input. This delay constraints the max. allowable clock speed.The min pg. delay corresponds to the time taken by atleast one output to start responding to a transition at input. This delay is typically much smaller than the max delay and determines the amount of skew a circuit can tolerate before race condition occurs. If d is greater tr,min + ti + tl,min than inputs at R2 can change before the previous inputs are latched.tf” tf’ + tr,min + ti + tl,min ORd tr,min + ti + tl,mintf” + T tf’ + tr,max + ti + tl,max ORT tr,max + ti + tl,max - d
64Positive and Negative Clock Skew Positive Skew: d > 0:In this case the clock is routed in the same direction as the data and the first equation needs to be satisfied. Violating it will result in malfuntioning of circuit. Observe that slowing down the clock period does not help. The positive skew actually helps improve the clock speed as it is a negative factor in the constraint on clock period T.Negative Skew: d < 0:The negative skew occurs when the data is routed in the direction opposite to the clock signal. The first equation is unconditionally satisfied and the circuit works correctly independent of the skew. Unfortunately, negative skew will limit the clock speed and thus lower the performance, as predicted by the second equation: the skew reduces the time available for computation by |d|.
65c a b d a b a b c d Setup time met Hold time met Launch Clock Capture abcdSetup time metHold time metLaunchClockabcCaptureClockabd
66c a b d a’ b’ a b c d Setup time violated Hold time violated Launch bcdSetup time violatedHold time violatedLaunchClockabcCaptureClocka’b’d
67c a b d a b c d Setup time violated Hold time met Launch Clock Capture abcdabSetup time violatedHold time metcLaunchClockCaptureClockd
68Setup Violations result from worst case timing FF 1logicFF 2startpointendpointsetuprelationshipholdSetup Violations result from worst case timingHold Violations result from best case timing
69Chip Level Timing Issues Blocks 4 & 8 communicate and need their clocks to be skew allignedThe data signals between Blocks 4 & 8 could take more than one clock cycle and can get routed through blocks 5 and 61CGU23465781CGU2346587This makes chip level timing closure difficult and sensitive to geometry.A hierarchical design style, where each chiplets are timing closed independently and chip can be composed from such chiplets. Solution: Latency insensitive design.
70Categories of Synchronization Clock BasedData BasedGSDouble LatchGALSHandshake: 2 Phase, 4 PhaseGRLS (KTH Technology)Asynchronous – 2 Clock FIFOThe techniques for synchronization can be divided into two broad categories, one that is based on clock and the other that is based on data.On the clock based side we have the Globally Synchronous style which inspite of all the problems and criticism is the most deeply entrenched and is unlikely that we will completely abandon it anytime soon. The Globally Asynchronous and Locally Synchronous or GALS style has become the most talked about synchronisation style and in its purest form involves some fancy techniques for stretching the clock until the data is safely exchanged. As it involves some non standard design techniques the purest form of GALS is yet to become mainstream but there are other GALS style that does the same thing in spirit but uses standard tools flow and this is what I would like to emphasize in the next few slides.On the data side, we have the ever popular double latching which once again inspite of its weakness is going to remain at the heart of most clock domain crossing techniques including the one is used in the Islands of Synchronicity.2 and 4 phase handshake is also used in IOS and the one that in many respects is at the heart of IOS methodology as far clock domain crossing techniquesambiguityLatencyData basedsynchronizationClock basedsynchronizationConstraintsComplexity
71Send and Forget – Double Latching ACL: Asynchronous Communication LinkSourceDestinationACLSDCLKsCLKDPDPSDQPsPDCLKDDouble latching may not be the ideal island hopping technique for several reasons but it is still very useful. The problem it tries this technique tries to solve is to avoid metastability and increase the MTBF to acceptable levels.To look at why this technique works and why it fails, let us consider a simple model that describes the dynamics of voltage rampup in a simple synchronizer. There are two factors that decides the ramp up. One is a constant Metastable value Vms at midpoint between the high the low region. The second factor is the difference between Vms and the v(0) the instantaneous voltage sampled by the clock at time 0 that is exponentially amplified. So if this difference in the second factor is zero or close to it the ramp up or down would be very slow but it will eventually settle to one or zero. So we want to give this dynamics the maximum time to settle down before consuming it in the synchrnous world en entire clock period is the maximum one has and that is all what double latching is about.So what is the problem, there are two issues, the first one is that if the same signal is double latched at different places and if because of skew they sample different values, i.e. V(0) is on different sides of the Vms value, then the double latching will saturate to different values in different parts of design and that can be potentially disastrous.
72Send and Forget – Double Latching The second problem is that to be absolutely safe the destination frequency has to be slightly lower then the payload frequency to factor in the sample and hold window and the jitter.The third issue with double latching is that there is no flow control and that is the reason why it is called send and forgetBut all in all it is a good robust low cost method for clock domain crossing method that is widely used to single bit control data that is often used to qualify the data transfer which is typically not double latched. Data can also be double latched if it is grey coded.The MTBF lowers as the frequency increases without double latching can be down to thousands of seconds at relatively modest frequency without double latching as you can see from this data from TI.AdvantagesGood choice for single bit control dataGrey coded multi bit data payloads are also targetDisadvantagesNo Flow Control Send and ForgetMetastable signal to multiple targets could resolve to different values
73Handshake ACL Asynchronous Communication Link CLKsCLKDPDPSRSASRDADDQASPsPDFSMADRDRSCLKsCLKDPd: Destination PayloadPs: Source PayloadHandshaking for data transfer is a very reliable clock domain crossing method and comes in two variants the two phase handshake that requires detection of polarity of signal and the four phase protocol that relies on edge detection and returns the control signals to zero.The logic involved in generation of the handshake signals request and acknowledge often involve use of Muller C Element that is easily realised using an SR FlopThe flow control signals could suffer from metastability but the protocol guarantees a clean transfer of dataAnd to reduce the metastability in the flow control signals they are double latched.
74Source: 27 MHz, Destination: 200 MHz Data payload frequency must be less than the worst-case round trip delay of the flow control2-phase3Ts + 3Td ≥ TPs4 phase6Ts + 6Td ≥ TPsWhile the flow control adds robustness, there is performance penalty to be paid. The data payload frequency must be less than the worst case round trip delay of the flow control and in case of 2 phase protocol with double latching of the flow control signals involves 3 cycles on the source side and 3 cycles on destination side and the round trip delay for the 4 phase protocol is even force involving 6 cycles on each side.If you translate this to a real life scenario where we have an isochronous traffic coming in at 27 MHz and being consumed by a bursty destination at 200 MHz and if we decide to use the 2 phase protocol the maximum payload rate we can sustain would be down to about 8 MHz.Example:Source: 27 MHz, Destination: 200 MHzMaximum isochronous data rate using 2 phase protocol3*(37nS) + 3*(5nS) = 126 ns = 7.9 MHz
752-phase 4 phase 3Ts + 3Td ≥ TPs 6Ts + 6Td ≥ TPs 2-phase 4-phase The period for whichdata remains valid/asserted4 phase6Ts + 6Td ≥ TPsNote that TPs does not decide data payload frequency. TPs is less than the round trip delay to enable the next payload to be transferred immediately after the round trip delay is over.The period (TPL)corresponding to the data payload frequency has to be more than the worst case round trip delay i.e. 3Ts + 3Td ≤ TPL and 6Ts + 6Td ≤ TPL for 2 and 4 phase protocols respectively. This is illustrated in the example belowData payload frequency must be less than the worst-case round trip delay of the flow control2-phase3Ts + 3Td4-phase6Ts + 6TdExample:Source: 27 MHz, Destination: 200 MHzMaximum isochronous data rate using 2 phase protocol3*(37nS) + 3*(5nS) = 126 ns = 7.9 MHz
762 Clock Asynchronous FIFO An improvement in terms of data transfer rate that can be sustained is the concept of 2 clock asynchronous fifo.The asynchronous FIFO completely decouples the source and destination islands. The source island can write data to a non –full FIFO and the read island can read data from a non empty FIFO. Each island can process data on every clock cycle and no interaction is necessary between two islands to pass data.Flow control is achieved by monitoring the full and empty flags. Alternatively, the FIFO levels can be monitored and action can taken at different trigger points.The write interface consists of grey code counter that generates the write pointer where the newest element is written. The read pointer is synchronised by double latching before processing it and together with the write pointer helps decide the FIFO being full and the write level.The read interface is very similar to the write interface and the read pointer indicates the oldest element in FIFO. A flush signal loads the read pointer with the synchronised write pointer value thereby clearing the FIFO.The design is tolerant to metastability because either the current or the previous value will be sampled. Which means that the FULL flag may be active inspite there being room to write, but atleast nothing gets overwritten. Similarly the empty flag may be active inspite of FIFO not being empty but once again this is better than reading an empty FIFO.While this FIFO provides for a robust clock domain crossing there are some issues with when deploying it as an island hopping technique because typically the read side must be able to send the readaddress to the storage FIFO and get read data back in one cycle.So the solution is to move the storage data to the read island but then we have the issue of write data and write clock having to travel arbitrary distance between islands and we do not want write clock originating from one island to be used as it is in another island as we will run into the problems of Skew and clock tree balancing that we started out to avoid.Fail Safe, Self Correcting:Write logic could think the FIFO is full when it is notRead logic could think that the FIFO is empty when it is notNot suitable for Island hopping:Storage in Write Island is a problemTypically the read side needs to be read every cycle
79Clocking and Communication Schemes Synchronous Design – phase and skew allignedMesochronous Design – same clk freq and phase allignedRatiochronous DesignDifferent Clock freqs but have rational relationship – phase allignedKTH researchPleisochronousNo rational clock relationship – phase relationship driftsAsynchronous
80Ideal vs Real ClockDuring the initial phase of synthesis clock is idealset_auto_disable_drc_nets command should be used to prevent DC from wasting time on fixing DRC violations on high fanout nets like Resets and ClocksModel skew and jitter effects using the set_clock_uncertainity commandModel clock network latency using set_clock_latency commandOnce clock tree has been inserted use the set_propagated_clock command to use the actual clock. Back annotation using read_sdf command is required