3Outline Introduction KLPG Results on Silicon Supply Noise & Power ModelConclusions
4Test Cost Must Fall Fast Test cost/transistor must follow Moore’s LawNeed 100xtransistortest costreduction!ITRS2005, Cost-Perf MPU
5But … Test cell cost not following Moore’s Law Already using DFT tester or old ATEHandler and probe card cost not scalingHigh-speed I/Os cost moreMust reduce test timeParallel testing running out of gasMust reduce test time per transistorConstant test time per chip
6Reducing Test Time Per Transistor Less time, more transistors must wiggle more wires in less timeHigher power dissipationBut… tests/transistor rising to cope with DSMEven higher power dissipation
7But … Max Power/Transistor is Falling ITRS2005, Cost-Perf MPU
8Future Digital Test All About Power? Fraction of chip that can be fired up at one time is decreasingMission-mode power constraintsTest supply noiseTest thermal limitsLimits intra-die test parallelismHow to screen most defects per Joule?
9Eliminate Wasted Energy Useless transitionsScan powerUnnecessary capture powerMuch research/commercial activityLow-odds test patternsLuck – tails of BIST, WRPShotgun blasts – N-detect, DOREME, TARO, …
11Our Delay Test Research Defect-Based Delay test ATPG considering:Resistive shorts and opensProcess variationCapacitive crosstalkTemperature gradientsPower supply noisePower dissipation
12Kitchen Sink Fault Model Have we forgotten anything?CrosstalkDie-to-DieSupplyTempLithoIntra-DieSpot DefectNoiseProcess VariationFunctionalFailureLocalDelay FaultReliabilityHazardCombinedDelay FaultGlobalDelay Fault
13Target Realistic Defects Resistive ShortStanojevic et alResistive OpenMadge et al
14But… Fault population too large Limited fault model accuracy Limited fab dataLimited calibration time, costFault model must be abstract enough for fortuitous detection of unmodeled faults“The vectors do the work, not the fault model” – J. H. Patel
15Our ApproachTest K longest rising/falling paths through each gate/line (KLPG)Targets resistive opensTargets resistive shortsSensitize opposing linesFew bridges per line with largest critical area [Tripp]Larger K deals with delay uncertaintySupply noiseProcess variationDelay modeling errorsCrosstalkAnalogy to N-detect
16Outline Introduction KLPG Results on Silicon Supply Noise & Power ModelConclusions
17K Longest Paths Per Gate (KLPG) CodGen ATPG developed at Texas A&MCodSim fault simulatorTests K longest paths through each gate/lineDetects small defects on each gate/lineCovers all transition faultsNeeds SDFMay produce more patterns than TF test
18Test Generation Algorithm Search spaceScan cellsScan cellsConstraints from outside search space
19KLPG Test Generation Flow StartExtend the partial path with longest potential delayInsert into the partial path storeYApply side inputs and perform direct implicationsApply heuristics to avoid false pathsNConflict?Complete path?NEndFinal justificationY
20Apply KLPG to Industrial Designs TetraMAX/FastScandofile/procfile/libraryHierarchical Verilog DesignSame inputs as TF test generationCPU Time:~3x TF test generationMemory:400 MB/1M gatesKLPG Test GeneratorsdfKTest Data…Only 0’s & 1’s are different from TF ATPG outputsTest Sequence……Load patternPulse clock......Tester
21Chips are Slower Using KLPG Test Transition fault test10 nsKLPG-1 test11 ns180 nm / 40k gates / full scan / 2.3k scan cells
25Up to 3% delay decrease seen in KLPG-1 KLPG Test ResultsUp to 3% delay decrease seen in KLPG-1KLPG unique detects
26KLPG with Bridge Faults KLPG-1 targets resistive opensSAF, N-detect, KLPG-1 tests have good coverage of resistive shortsSar-Dessai and Walker, ITC99Qiu, Walker et al, TECHCON03, VTS04Sensitization much easier than propagationPropagate first, then sensitizeIgnore input-dependent gate strengthIgnore opposing transition
27Bridge Fault ATPG Approach Generate longest path through bridge siteSet DC bits to sensitize opposing value on bridged line (e.g. 0 opposing )No extra uncompacted patterns, since need to test resistive opensElse, set opposing value first, then generate path“Top-off” patterns, but may compact
28Bridge Fault Robust LOC Results Circuit# Lines# ShortsRobust KLPG-1 with ShortsRobust KLPG-1 w/o Shorts# Test PatternsCPU (m:s)s1320713 20726 4141 0062:309092:25s1585015 85031 7005202:454722:35s3593235 93271 8644315:513614:31s3841738 41776 8341 06115:0394914:21s3858438 58477 16858912:0052611:202 random non-feedback bridges to each line = TF count. Shorts between gate inputs and to power/GND are excluded.
29Bridge ATPG Results Modest cost increase Pattern count increases %ATPG time increases <9.2%Expect less impact on large designs, due to lower care bit density
31Dynamic Compaction Test Set Compaction Static Compaction Performed after test generationDynamic CompactionPerformed during test generationClassic method good for stuck-at tests but not suitable for path delay testsDevelop dynamic compaction for KLPG tests31
32Dynamic Compaction Approach Vector pair and NAs (circles) for Path1V2V1I1I2I3I4I5I6O1O2O3O4O5O6Path1XX110XX11X00AO2O3O6O5Vector pair and NAs for Path1 & 2V4I1I2I3I4I5I6O1O4Path11X00XXAxPath2BVector pair and NAs (Xs) for Path2V3XX1XX0I1I2I3I4I5I6O1O2O3O4O5O6xPath2BNow let’s go to the basic idea of our dynamic compaction algorithm. When a path is generated, a set of necessary assignments necessary to sensitize and propagate the transition along the path are identified.This picture shows Path1 with falling transition through line A. Circles are the necessary assignments for Path1. V1 is the vector pair generated through a PODEM-like final justification procedure. V2 can detect Path1 too. But because of the limitation of PODEM-like final justification algorithm, it can not be generated alone.The second picture gives Path2 with rising transition through line B. Suppose V3 is the only vector pair can test Path2.If we generate V1 and V3 at first, later we can not compact them together because first bit of input I2 conflicts. But potentially we can compact path1 and path2 together since V2 is compatible with V3. How can we do it? The solution is to combine two sets of necessary assignments together and call final justification procedure. SoV4 can be generated successfully to test Path1 and Path2 at the same time.32
33Dynamic Compaction Algorithm Definitions:vector : output for ATEpattern : a set of necessary assignments associated with one or more pathsPOOL : a data structure to save patternsCheck the compatibility between necessary assignments of new path against a pattern in the POOLGeneration of final test vector is postponed until test generation is finished33
34Dynamic Compaction Flow Start with new pattern FPOOL empty?YEndInsert F into POOLSet P to the first pattern in POOLNYConflict check between F and PNEnd of POOL?Conflict?Set P to the next pattern in POOLYCombine necessary assignments of F and P & Do Final JustificationNNPass Justification?Update P with FYReorder P in POOL34
35Dynamic Compaction Experiments K Longest robustly testable path generation through each line (K=1)Launch-on-shift/captureCompare to static compactionPOOL size influence on vector countKLPG-1 vs Transition Fault Test35
36Dynamic Compaction Algorithm Definitions:vector : output for ATEpattern : a set of necessary assignments associated with one or more pathsPOOL : a data structure to save patternsCheck the compatibility between necessary assignments of new path against a pattern in the POOLGeneration of final test vector is postponed until test generation is finished36
37Circuits ISCAS 89 benchmark circuits Full scan Unit delay model Chip1 (44K Gates)Partial scanEmbedded memoriesChip2a (22K Gates)SDF delay37
38Robust Test (launch-on-capture) % Vector Reduction RateVectorCount60%60%48%21%55%43%39%53%37%23%26%38
40KLPG-1 Test Set Construction Non-robusttestRobust testLongtransitionfault testA long transition fault test tests longer paths than a regular transition fault test40
41Test Size (KLPG-1 vs. Transition) chip1chip2achip3RobustNon-robustLongTFTotalComm. TF28967302231244286842541146736524913470453528119245210317471900619687493179925374406168855066441445For comparison, this figure shows the dynamically compacted KLPG-1 test size and Transition Fault Test size.Column 2 to column 4 give the vector size of robust test, top-off Non-robust test and top-off Long transition tests. Please note that many non-robust paths are also compacted into previous generated robust test vectors. This number only shows the new vectors generated. The same for Long TF.Column 5 gives the total vector number of KLPG-1 test. The number of transition fault test vectors generated by the commercial tool is listed in the last column. You can see that our KLPG-1 test size is at the same level with transition fault test. In several cases, KLPG-1 test size is even smaller than transition fault test.For chip3, the gate No is around 600K. KLPG-1 test size is around 5 times of commercial tool. But in this case commercial tool generated very small vector size for this design. This indicates that many transition faults in chip3 are easy-to-detect but testing them through the longest paths results in many more necessary assignments and lower compaction rate.Intuitively KLPG-1 test size should be several times bigger than commercial tool, since we add more constrains into the circuit.Considering the higher quality of KLPG-1 test, it is very promising.41
42Dynamic Compaction Results Dynamic Compaction for KLPG testsUp to 3x reduction in vector count~2x CPU time increaseSmall additional memory consumptionKLPG-1 test size comparable to commercial transition fault test42
43Dynamic Compaction Future Work Heuristics to accelerate dynamic compactionAdvanced algorithms for more optimal resultsDynamic compaction for more complicated industrial designsConstraints for power supply noise and temperature43
44Delay Fault Coverage Metric VTS04 metric not constructive for delay test qualityNeed longest path through each line to accurately compute it – must run KLPGSDQM has same problemDie-to-die and intra-die process variationDie-to-die now done as post-process – wastefulSimple bounds on when to stop path generation – coverage vs. pattern count
45Fault Coverage vs. K c7552 Drop fault when UB/LB coverage falloff Most sites need only a few paths
46Ideal K in C5315 with Die-to-Die Most sites need 1 or 2 pathsMost paths in many-path sites are ~same lengthCan drop most w/o much coverage loss
47Capacitive Crosstalk Crosstalk affects near-critical paths Don’t worry about near-critical due to spot defect – probability dominated by defectConsider case (b)
48Capacitive Crosstalk Filter out couplings based on arrival time Use simple greedy algorithmCouplings in order of delay increaseSensitize opposing transition one at a timeMay miss many little coupling caseWorry about timing alignment?ProbabilisticCompaction impact? – More care bits
49Crosstalk AlignmentNeed path from PI to crosstalk site to have correct timingKLPG ATPG algorithm uses min/max delay constraintsTargets are opposing transition in timing windowConstraints narrow as path is builtIf potential alignment/transition is not realized, drop targetUpdate timing with each crosstalk site, since could set other crosstalk sites to help or oppose
50Outline Introduction KLPG Results on Silicon Supply Noise & Power ModelConclusions
51Supply NoiseSupply noise significantly impacts the timing performance of DSM designsFrequencyGate DensityPower DensitySupply VoltageDelay sensitivity to voltageExcessive supply noise may come from:Random fill of don’t care bitsTest pattern compactionNoise longer delay OverkillAs technology advances to DSM regime, designs have become more and more sensitive to supply noise. Let’s see the following trends:Operating Frequency and gate density increases which means there are more simultaneous switching activity per unit area so power density increasesIn the mean time Supply voltage level decreases also gate delay becomes more sensitive to voltage level variationThese trends lead to a more significant power supply noise impact on delayWhy there’s excessive supply noise in delay testing compared with real functional mode? They may come from two sources: first is random filling of don’t care bits Delay test patterns generated by ATPG usually has a low fill-rate, and most bits are don’t care bits. In industry, random fill of those don’t care bits is usually applied to increase fortuitous detection of non-target defects. Unfortunately, industry data shows random fill can produce excessive supply noise. A second source of excessive noise comes from compaction. A highly compacted test pattern may generate excessive noise as well.As I have mentioned in the motivation slide, excessive noise causes unexpected extra delay, and finally results in noise-induced overkill. So that’s the background of the whole problem.
52Concept: Effective Region Circuit extracted as RC networkEffective Region for a device:RC time constant < Clock cycleAssumption: all caps in region are equally effectiveNo action in current cycle, irrelevantDischarge in current cycle, effectiveFirst, I want to introduce several key concepts for the model. The first concept is effective region.We first extract the circuit as a RC network. Assume a current impulse occurs somewhere in the network. Capacitors around this impulse will begin to discharge from nearby to far away, and result in localized voltage drop. If, a capacitor is far enough away, possibly it will not discharge within the current clock cycle. Such capacitors are considered irrelevant to the noise analysis in current cycle. Therefore, we define an effective region for a switching device as the largest area centered by the switching device, and its RC time constant is less than the clock cycle time. It means, in the current clock cycle, the switching current for a device only comes from capacitors in its effective region.To make things easier, we further assume that all capacitors in the region are equally effective regardless of where the capacitor is. For instance, there are two capacitors A and B. A is very close to the center of effective region. B is close to the border. But as long as A & B has same capacitance. they will get discharged same amount of charge to the switching device.AB
53Find Effective Region for a Device Current Algorithm: search region radius from small to maximumPractical improvement: binary searchPerform only once for one designrTo find out the effective region for a device, we can look at the circular regions centered by this device, and search from a small radius to maximum, and stop once the RC time constant of the area exceeds clock cycle time.A simple and practical improvement on the search algorithm is to do binary search which half of the maximum radius.The effective region is quite static. This is because resistance is static, and decoupling capacitance is also static. Parasitic circuit capacitance is usually much smaller than decoupling capacitance, and it varies little from pattern to pattern. Therefore, we only need to perform this search algorithm once.
54Concept: Grid Grid is the smallest unit for analysis RC time small enough compared with clockUniform voltage levelEach Grid Contains:Decoupling capacitanceParasitic capacitanceSwitching devicesEach grid an Effective RegionA second key concept is GridEffective Region is a concept to define locality. If, two switching device are close enough, they will have same effective region and same voltage level. It saves us time to put them together for analysis. So here we define the grid concept, which is the smallest unit for analysisWe divides the whole circuit into n*m grids. Each grid contains: 1) decoupling cap 2) parasitic cap 3) and a bunch of switching device. The RC time constant of a grid is small enough compared with clock cycle time, so we can safely make approximation that all switching device in a grid has the same voltage, and the same effective region. Therefore, we can also say one grid is associated with one effective region.We also view an effective region as a set of grids, instead of an area of devices and capacitors.This grid concept makes our computation much simpler. And it has a slight impact on accuracy.Each Effective Region consists of a set of grids
55Grid Noise ModelCdCpIswitching_1Iswitching_nSwitchingDevicesAssumption: Off-chip current ignored during the launch cycleSwitching charge is equally provided by all grids in its effective regionThe basic idea is: During the beginning period of the clock cycle, when most switching activity occurs, the power pads are unable to provide current immediately to satisfy the switching current demand. This is because off-chip inductance prevents the supply current from rising immediately. Therefore, most charge demanded by the switching devices comes from on-chip capacitance in the effective region. Here, the figure shows the model for one grid. Note that here, grids are not independent of each other. Each grid gets charge from the grids in its effective region, and it also get discharged for some other grids if it belongs to their effective regions.Here, we make an assumption that we simply ignore off-chip current. Our reason is, it has little impact on propagation delay since most transitions have completed before the off-chip current rises appreciably.For effective region, we have made an assumption that any capacitors in the effectively region are equally discharged. Now we shift to unit of grid, we also assume that all grids in the effective region are equally discharged.
56Grid Noise Model Vmax = ( ( i · Qi )) / ( Cd + Cp ) Iswitching_1Iswitching_nSwitchingDevicesVmax = ( ( i · Qi )) / ( Cd + Cp )Grid i: a grid whose effective region covers current gridQi: switching charge of Grid ii: fraction of Qi provided by current gridBased on these model, we are now analyzing every grid to calculate its worst-case voltage drop. As I mentioned in the previous slide, a grid should provide charge to those grids if it belongs to their effective region. So the worst-case voltage drop here is the total charge provided by this grid, divided by the grid capacitance.Here Grid I is a grid whose effective region covers current grid. Qi is the total switching charge demanded by Grid I. Qi is shared by all grids in grid I’s effective region. so alpha I is the fraction of Qi that comes current grid.Using this function, we can find out maximum voltage drop for all the grids on the circuit.
57Switching Current Model Dynamic Charging CurrentIpeaktbegintendtDynamic Charging Current:Look-up table by simulationCharge: Q = 0.5 · Ipeak · (tend – tbegin)Short Circuit Current:empirical function (Saturation Current, wire and device capacitance)Switching charge needs to be calculated for each device, so we need switching current model for this calculationSwitching current drawn from the supply network in CMOS circuits mainly consists of two parts, the dynamic charging current on the output capacitive load, and the short circuit current. In most cases dynamic charging current is more significant than short circuit current.Here’s the dynamic charging current waveform. We model it as triangular. A table is built by simulation for each cell, such that one can determine its peak current and output transition time for different values of output load and input slope. Once we get the peak current and transition time from the table, we can get the total charge by calculating the area of this triangle.Short current is usually insignificant compared with dynamic charging current. We simply use an empirical function here to calculate short circuit charge.
58Delay Model Look-up table at nominal voltage By simulation Delay = f(tin, Cout)Out_slew = g(tin, Cout)Delay/slew is linear to supply voltagelinear factor by simulationIn our work, we first model both nominal delay and transition time as a function of input slope and output capacitive load. A look-up table is built for each library cell using simulation.We then use the linear model to calculate real delay and slew rate as a function of voltage. Again, the linear factor comes from library cell simulation.In practice, we take voltage drop as half of worst-case, and apply delay model to calculate noise-aware delay.
59Supply Noise Analysis Flow StartEndFind EffectiveRegionskipCalculate DelayGet Voltage DropLoad VectorHere’s the comprehensive supply noise analysis flowFor each circuit, we need to associate each grid with a an effective region. This procedure only needs to be done once for each design before the first test pattern applied, then it can be skipped for the following patterns.F each test pattern, we will do logic simulation, assign switching charge to related grids, calculate voltage drop for each grid, and then calculate noise-aware delay.The complexity of this procedure is O(cell_count + grid_count2). In practice, we just make grid_count less than square root of cell_count, since it’s enough for accuracy. so the actual complexity is linear to cell_count only, which is the same as logic simulation.Switching Chargeassigned to gridsLogic SimulationComplexity: O(cell_count + grid_count2)Typically grid_count2 < cell_count
60Experimental Design Experiments on NXP design 130nm DSP-like design (1M+ transistors)LOC path delay patterns with “X” bitsstatically sensitized paths ensures transitions propagate on the target pathFilling strategy: randomly set “X” bits to 1 with a specified rateGenerate filled patterns using various fill ratesWe perform experiments on the same design that I introduced earlier. It is a 130nm DSP-like core, with over 1M transistors.We use LOC path delay patterns with lots of don’t care bits.The paths are statically sensitized so that we ensure transitions propagate on the target path.We then apply filling to these patterns, where we randomly set don’t-care bit to 1 with a specified rate, and fill the rest don’t care bits with 0.We also generate several batches of filled patterns using various filling rate.
61Experimental Measurements Path delay by analysis is correlated with measurementWe then apply our supply noise analysis approach to these test patterns and show correlation with tester measurement. The correlation here is 0.83, which is pretty good. We’ll discuss the offset in the next slide.
62Experimental Measurements Noisy patterns cause significant delay increaseMeasured offset due to delay model characterizationIn this figure, the patterns are ordered by filling rate on x axis. The bottom blue line is nominal delay by delay model, yellow dots are our noise-aware analysis based on delay model, and the top purple dots are test measurements. This figures shows clearly our noise analysis predicts a similar trend as tester measurements. The offset comes from delay model characterization, since there’s a large mismatch between nominal delay and test measurement when noise is small.Ordered by fill rate
63Supply Noise Future Work Supply noise model refinementOff-chip dI/dt currentArray-bond chipsGround bounceBetter activity estimationFocus effort on noisy patternsIncremental estimation for ATPGAvoid logic simulation
64Constant Power Dissipation Constant power linear temperature riseEasy to characterizeKnow temperature for each patternAdjust capture clock timingLonger delay as temperature rises35-55% delay increase for 100C rise in 65 nmReorder patterns for constant power dissipationConsider groups of 10 patternsTakes 1-10 ms for ~1C rise200 bit scan 100 MHz 2 s/pattern10 patterns = 20 s << 1 ms
65Minimize power variation Constant Power FlowDynamicCompactionIssuesNeed fast power modelPatterns not independentPower due to both scan-in and scan-out switchingMentorPreferred FillReduce capture powerAdjacentFillReduce average powerReorderPatternsMinimize power variation
66Power ModelingPrior work by Touba et al indicated WSA proportional to scan chain switchingImproved using scan chain WSAScan cell feeding more gates likely to cause more circuit switchingMost circuit switching during scan happens in first few levels of logicExperiments showed almost no difference in pattern reordering results using model vs. simulation (exact) results
68Constant Power Algorithm Compute shift power of each pattern; /* power model */Group patterns in order using specified group size;Compute total power P[i] of each group i;Compute average power ave of all groups;while iteration count not exceeded, dofor each group i, doif P[i] > (1+pvb)*ave /* pvb = power variance bound = 0.05 here */Find pattern Pn with lowest power in group j with lowest total powerSelect pattern Pm with highest power in group i and swap with Pnelse if P[i] < (1-pvb)*aveFind pattern Pn with highest power in group j with highest total powerSelect pattern Pm with lowest power in group i and swap with Pnelsecontinue to next group;Re-compute shift power for Pn-1, Pn, Pm-1, Pm /* power model */Re-compute total shift power for group i, jUpdate ave;
70Constant Power Results Fast - < 1 minute on ISCAS89Std. Dev./Average drops by 2.5-6x~3% on ISCAS89 circuitsRemaining variation mostly due to high-power patternsSolution: veto high-power patterns during compaction
71Conclusions Demonstrated KLPG on industrial designs Modest test data volume increaseAffordable ATPG time increaseDemonstrated noise model on industrial designDemonstrated constant power reordering
72Future Work Demonstrate on industrial data Fault Coverage Metric Drop faults detected with high probabilityExploit spatial and structural correlationMaximize coupling capacitanceUse supply noise model in compaction and fillingdI/dt model and multi-cycle launch
73Acknowledgements Current Students Zheng Wang Zhongwei Jiang Shiva GanesanFormer StudentsJing Wang (AMD)Lei Wu (TI)Wangqi Qiu (Pextra)Colleagues at TI, NXPSponsors – NSF, SRC