Download presentation
Presentation is loading. Please wait.
Published byElmer Curtis Modified over 8 years ago
1
soc 2.1 Chapter 2 Chip Basics: Time, Area, Power, Reliability, Configurability Computer System Design System-on-Chip by M. Flynn & W. Luk Pub. Wiley 2011 (copyright 2011)
2
soc 2.2 Basic design issue: Time clocking pipelining –optimal pipelining –pipeline partitioning –wave pipelining and low overhead clocking
3
soc 2.3 SIA roadmap
4
soc 2.4 Tradeoffs in IP selection and design: performance, area, power
5
soc 2.5 Clock parameters parameters –P max : maximum delay through logic –P min : minimum delay through logic – t : cycle time (in seconds per cycle) –t w : clock pulse width –t g : data setup time –t d : register output delay –C : total clocking overhead tgtg –tw–tw P max tdtd tt t = P max + C
6
soc 2.6 Skew skew: uncertainty in the clock arrival time two types of skew –depends on t.....skew = k, a fraction of P max where P max is the segment delay that determines t large segments may have longer delay and skew part of skew varies with L eff, like segment delay –independent of t....skew = can relate to clock routing, jitter from environmental conditions, other effects unrelated to segment delay effect of skew = k(P max ) + –skew range adds directly to the clock overhead
7
soc 2.7 Optimal pipelining let the total instruction execution without pipelining and associated clock overhead be T in a pipelined processor, let S be the number of segments S - 1 is number of cycles lost due to a pipeline break let b = probability of break, C = clock overhead incl. fixed skew
8
soc 2.8 Optimum pipelining P1P2P3P4 T suppose T = i P max i without clock overhead S = number of pipeline segments C = clock overhead T/S max (P max i ) [quantization] P max i = delay of the i th functional unit
9
soc 2.9 t = T/S + C performance = 1/ (1+(S - 1)b) [IPC] throughput = G = performance / t [IPS] G = Find S for optimum performance by solving for S: we get Cycle time Avg. Time / segment Clock overhead
10
soc 2.10 Find S opt estimate b –use instruction traces find T and C from design details –feasibility studies example:
11
soc 2.11 Quantization + other considerations quantization effects –T cannot be arbitrarily divided into segments –segments defined by functional unit delays –some segments cannot be divided; others can be divided only at particular boundaries some functional operations are atomic –cycle: usually not cross function unit boundary S opt –ignores cost/area of extra pipeline stages –ignores quantization loss –largest S to be used
12
soc 2.12 Microprocessor design practice tradeoff around design target optimal in-order integer RISC: 5-10 stages –performance: relatively flat across this range –deeper for out-of-order or complex ISA (e.g. Intel Architectures) use longer pipeline (higher frequency) if –FP/multimedia vector performance important –clock overhead low else use shorter pipeline –especially if area/power/effort are critical
13
soc 2.13 Advanced circuit techniques asynchronous or self-timed clocking –avoids clock distribution problems but has its own overhead multi-phase domino clocking –skew tolerant and low clock overhead; lots of power required and extra area wave pipelining –ultimate limit on t t = P max - P min + C
14
soc 2.14 Basic Design Issues: Silicon Area, Power, Reliability, Reconfiguration die floorplanning methodology area-cost model power analysis and model reliability reconfigurable design soft processors
15
soc 2.15 AMD Barcelona multicore http://www.techwarelabs.com/reviews/processors/barcelona/
16
soc 2.16 Die floorplanning methodology pick target cost based on market requirements determine total area available within cost budget –defect and yield model compute net available area for processors, caches and memory –account for I/O, buses, test hooks, I/O pads etc. select core processors and assess area and performance re-allocate area to optimize performance –cache, signal processors, multimedia processors, etc.
17
soc 2.17 Wafers and chips suppose the wafer has diameter d and each die is square with area A d
18
soc 2.18 If N is the number of dice on the wafer, N = d) 2 / (4A) [Gross Yield] Let N G be number of good dice and N D be the number of defects on a wafer. Given N dice of which N G are good.....suppose we randomly add 1 new defect to the wafer. What’s the probability that it strikes a good die....and changes N G ? Wafers and chips: example
19
soc 2.19 Probability of the defect hitting a good die = N G / N The change in N G is d N G /d N D = - N G / N Rewriting this we getd N G / N G = - ( 1/N) d N D Integrating and solving:ln(N G ) = -N D /N + C Since N G = N => N D = 0, C must be ln(N) N G / N = Yield = e - N D /N let defect density ( defects / cm 2 ) = D N d = D x wafer area = D x A x N Yield = N g / N = e - D A typically D = 0.3 – 1.0 defect / cm 2
20
soc 2.20 Using yield to size a die to find the cost per die: 1. find N, the number of die on a wafer 2. find Yield 3. find N g = Yield x N 4. cost/die = wafer cost/ Ng
21
soc 2.21 Effect of defect density
22
soc 2.22 What can be put on the die? depends on the lithography and die area lithography determined by f, minimum feature size feature size is related to the mask registration variation – f = 2
23
soc 2.23 Smallest device: 5 x 5 2 4 4 5 5
24
soc 2.24 Area Units: rbe and A rbe: small area unit for sizing functional units of the processor suppose we define another larger unit, A, as 1A =f 2 x 10 6,then 1A = 10 6 / 675 = 1481 rbe since 1481 is close to 1444 we can also refer to the simple register file as occupying 1 A
25
soc 2.25 Area of other cells 1 register bit = 1 rbe 1 CAM bit = 2 rbe 1 cache bit (6 tx cell) = 0.6 rbe 1 SRAM bit = 0.6 rbe 1 DRAM bit = 0.1 rbe = 67.5 f 2 These are the parameters for basic cells in most design tradeoffs
26
soc 2.26 Floorplan and area allocation Core processors Signal processor Cache Bus Memory Clock Test
27
soc 2.27 The baseline: I suppose d is 0.2 defects /cm 2 and we target 80% yield then A = 110 mm 2 gross or (allowing 20%) guard 88 mm 2 net if f = 0.13 we have 5200 A area units for our design we want to realize –a 32b core processor (w 8kB I & 16kB D cache) –2 32b Vector proc. W 16 x 1k x 32 vector memory + I and D cache –128kB ROM – anything else is SRAM
28
soc 2.28 The baseline: II This leaves 5200 - 2462 = 2538A available for data SRAM This implies about 512kB of SRAM
29
soc 2.29 Example SOC floorplan
30
soc 2.30 Die area summary cost: an exponential function of area successful business model –targets initial production at relatively low yield (~0.3) –ride learning curve and leverage technology to reduce cost and improve performance technical innovation and analysis –intersect with business decisions to make a product –use design feasibility studies and empirical targets –methodology for cost and performance evaluation –marketing targets: determine weighting of performance metrics
31
soc 2.31 Power consumption power consumption: becoming key design issue increased power: largely due to higher frequency operation
32
soc 2.32 Bipolar and CMOS clock frequency Bipolar power limit
33
soc 2.33 Bipolar cooling technology (ca ’91) Hitachi M880: 500 MHz; one processor/module, 40 die sealed in helium then cooled by a water jacket. Power consumed: about 800 watts per module. F. Kobayashi, et al. “Hardware technology for Hitachi M-880.” Proceedings Electronic Components and Tech Conf., 1991.
34
soc 2.34 Power: real price of performance As feature size & C (capacitance) decrease, the electric fields force a reduction in V. To maintain performance we also reduce V th So as V th decreases this increases I leakage and static power. Static power is now a big problem in high performance designs. Static power can be controlled by maintaining V th and using lower frequencies; also lowering V reduces dynamic power. Dynamic power Static power
35
soc 2.35 Power and frequency I = C dV/dt ….smaller C enables higher dV/dt (frequency) but I = (V - V th ) 1.25 /V and I also directly determines max. frequency. for V th = 0.6v, halving V also halves the frequency. (E.g. if V goes from 3 to 1.5v then freq is ½) so halving the voltage (V DD or the signal V) halves the frequency BUT reduces the power by 1/8 … (CV 2 f/2) so
36
soc 2.36 Power: a new frontier cooled high power: >70w/ die high power: 10- 50w/ die … plug in supply low power: 0.1- 2w / die.. rechargeable battery very low power: 1- 100mw /die.. AA size batteries extremely low power: 1- 100 microwatt/die and below (nano watts).. button batteries no power: extract from local EM field, ….O (1uw/die)
37
soc 2.37 Battery energy and usage typeenergy capacity timepower recharage able 10,000 mAh 50 hours (10-20% duty) 400mw- 4w 2xAA4000 mAh ½ year (10- 20% duty) 1-10 mw button40mAh5 years (always on) 1uw
38
soc 2.38 Power is important! by scaling alone a 1000 x slower implementation may need only 10 -9 as much power gating power to functional units and other techniques should enable 100MHz processors to operate at O(10 -3 ) watts goal: O(10 -6 ) watts…. implies about 10 MHz
39
soc 2.39 design for reliability using –redundancy –error detect and correct –process recoverability –fail-safe computation failure: a deviation from a design specification error: a failure that results in an incorrect signal value fault: an error manifests as an incorrect logical result faults –do not necessarily produce incorrect program execution –can be masked by detection/correction logic, e.g. ecc codes types of faults: –physical fault –design fault Reliability + computational integrity
40
soc 2.40 Redundancy: carefully applied P(t) = e -t/ –derived in the same way as the yield equation TMR (triple modular redundancy) system –additional reliability over a time much less than the expected failure time for a single module additional hardware –makes the occurrence of multiple module failures more probable
41
soc 2.41 Highly reliable designs typical usage –error detection: parity, residue, block codes; sanity & bounds checks –action (instruction) retry –error correction: code or alternate path compute –reconfiguration
42
soc 2.42 Why reconfigurable design? manage design complexity based on high-performance IP-blocks –avoid the risk and delay of fabrication time – support highly-pipelined designs area – regularity of FPGA, readily to advance to better process technology reliability – FPGA enables redundant cells and interconnections, avoid run-time faults
43
soc 2.43 Area estimate of FPGAs use rbe model as the basic measure –one slice 7000 transistors = 700 rbe –one logic element (LE) 12000 = 1200 rbe –Xilinx Virtex XC2V6000 = 33,792 slices 23.65 million rbe or 16400A 8 x 8 multiplier: around 35 slices –equivalent to 24500 rbe or 17A –1-bit multiplier in VLSI contains a full-adder and an AND gate 3840 transistors = 384 rbe around 60 times smaller than reconfigurable version block multipliers in FPGAs: more efficient
44
soc 2.44 Soft processors: using FPGAs soft processors how soft they are? –an instruction processor design in bit-stream format, used to program an FPGA device –cost reduction, design reuse, … major soft processors include: –Altera: Nios –Xilinx: MicroBlaze –open-source: OpenRISC, Leon –all 32-bit RISC architecture with 5-stage pipelines, connect to different bus standards
45
soc 2.45 Features: soft processors
46
soc 2.46 Summary best optimise: time, area, power cycle time: optimized pipelining area: die floorplanning, rbe model power: cooling + battery implications reliability: computational integrity, redundancy reconfiguration: reduce risks and delays –area overhead alleviated by coarse-grained blocks –soft processors: instruction processors in FPGA
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.