Soc 2.1 Chapter 2 Chip Basics: Time, Area, Power, Reliability, Configurability Computer System Design System-on-Chip by M. Flynn & W. Luk Pub. Wiley 2011.

Slides:



Advertisements
Similar presentations
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Advertisements

MICROELETTRONICA Sequential circuits Lection 7.
Slide 1Michael Flynn EE382 Winter/99 EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
CHALLENGES IN EMBEDDED MEMORY DESIGN AND TEST History and Trends In Embedded System Memory.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Synchronous Digital Design Methodology and Guidelines
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.
11/29/2004EE 42 fall 2004 lecture 371 Lecture #37: Memory Last lecture: –Transmission line equations –Reflections and termination –High frequency measurements.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
Configurable System-on-Chip: Xilinx EDK
Mehdi Amirijoo1 Power estimation n General power dissipation in CMOS n High-level power estimation metrics n Power estimation of the HW part.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 26: April 18, 2007 Et Cetera…
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.
Chapter 6 Memory and Programmable Logic Devices
Low Power Design of Integrated Systems Assoc. Prof. Dimitrios Soudris
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Computer performance.
6.893: Advanced VLSI Computer Architecture, September 28, 2000, Lecture 4, Slide 1. © Krste Asanovic Krste Asanovic
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
Low Power Techniques in Processor Design
Introduction to Computing: Lecture 4
Power Reduction for FPGA using Multiple Vdd/Vth
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
CAD for Physical Design of VLSI Circuits
Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.
Automated Design of Custom Architecture Tulika Mitra
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
1 CS/EE 6810: Computer Architecture Class format:  Most lectures on YouTube *BEFORE* class  Use class time for discussions, clarifications, problem-solving,
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
1 Moore’s Law in Microprocessors Pentium® proc P Year Transistors.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
1 Interconnect/Via. 2 Delay of Devices and Interconnect.
Basics of Energy & Power Dissipation
Morgan Kaufmann Publishers
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
Patricia Gonzalez Divya Akella VLSI Class Project.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
Overview of VLSI 魏凱城 彰化師範大學資工系. VLSI  Very-Large-Scale Integration Today’s complex VLSI chips  The number of transistors has exceeded 120 million 
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
CS203 – Advanced Computer Architecture
1 Introduction to Engineering Fall 2006 Lecture 17: Digital Tools 1.
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
CS203 – Advanced Computer Architecture
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
Technology advancement in computer architecture
Morgan Kaufmann Publishers
Architecture & Organization 1
Area-Time-Power tradeoffs in computer design: the road ahead
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Architecture & Organization 1
Overview of VLSI 魏凱城 彰化師範大學資工系.
Computer Evolution and Performance
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Utsunomiya University
Presentation transcript:

soc 2.1 Chapter 2 Chip Basics: Time, Area, Power, Reliability, Configurability Computer System Design System-on-Chip by M. Flynn & W. Luk Pub. Wiley 2011 (copyright 2011)

soc 2.2 Basic design issue: Time clocking pipelining –optimal pipelining –pipeline partitioning –wave pipelining and low overhead clocking

soc 2.3 SIA roadmap

soc 2.4 Tradeoffs in IP selection and design: performance, area, power

soc 2.5 Clock parameters parameters –P max : maximum delay through logic –P min : minimum delay through logic –  t : cycle time (in seconds per cycle) –t w : clock pulse width –t g : data setup time –t d : register output delay –C : total clocking overhead tgtg –tw–tw P max tdtd tt  t = P max + C

soc 2.6 Skew skew: uncertainty in the clock arrival time two types of skew –depends on  t.....skew = k, a fraction of P max where P max is the segment delay that determines  t large segments may have longer delay and skew part of skew varies with L eff, like segment delay –independent of  t....skew =  can relate to clock routing, jitter from environmental conditions, other effects unrelated to segment delay effect of skew = k(P max ) +  –skew range adds directly to the clock overhead

soc 2.7 Optimal pipelining let the total instruction execution without pipelining and associated clock overhead be T in a pipelined processor, let S be the number of segments S - 1 is number of cycles lost due to a pipeline break let b = probability of break, C = clock overhead incl. fixed skew

soc 2.8 Optimum pipelining P1P2P3P4 T suppose T =  i P max i without clock overhead S = number of pipeline segments C = clock overhead T/S max (P max i ) [quantization] P max i = delay of the i th functional unit

soc 2.9  t = T/S + C performance = 1/ (1+(S - 1)b) [IPC] throughput = G = performance /  t [IPS] G = Find S for optimum performance by solving for S: we get Cycle time Avg. Time / segment Clock overhead

soc 2.10 Find S opt estimate b –use instruction traces find T and C from design details –feasibility studies example:

soc 2.11 Quantization + other considerations quantization effects –T cannot be arbitrarily divided into segments –segments defined by functional unit delays –some segments cannot be divided; others can be divided only at particular boundaries some functional operations are atomic –cycle: usually not cross function unit boundary S opt –ignores cost/area of extra pipeline stages –ignores quantization loss –largest S to be used

soc 2.12 Microprocessor design practice tradeoff around design target optimal in-order integer RISC: 5-10 stages –performance: relatively flat across this range –deeper for out-of-order or complex ISA (e.g. Intel Architectures) use longer pipeline (higher frequency) if –FP/multimedia vector performance important –clock overhead low else use shorter pipeline –especially if area/power/effort are critical

soc 2.13 Advanced circuit techniques asynchronous or self-timed clocking –avoids clock distribution problems but has its own overhead multi-phase domino clocking –skew tolerant and low clock overhead; lots of power required and extra area wave pipelining –ultimate limit on  t  t = P max - P min + C

soc 2.14 Basic Design Issues: Silicon Area, Power, Reliability, Reconfiguration die floorplanning methodology area-cost model power analysis and model reliability reconfigurable design soft processors

soc 2.15 AMD Barcelona multicore

soc 2.16 Die floorplanning methodology pick target cost based on market requirements determine total area available within cost budget –defect and yield model compute net available area for processors, caches and memory –account for I/O, buses, test hooks, I/O pads etc. select core processors and assess area and performance re-allocate area to optimize performance –cache, signal processors, multimedia processors, etc.

soc 2.17 Wafers and chips suppose the wafer has diameter d and each die is square with area A d

soc 2.18 If N is the number of dice on the wafer, N =  d) 2 / (4A) [Gross Yield] Let N G be number of good dice and N D be the number of defects on a wafer. Given N dice of which N G are good.....suppose we randomly add 1 new defect to the wafer. What’s the probability that it strikes a good die....and changes N G ? Wafers and chips: example

soc 2.19 Probability of the defect hitting a good die = N G / N The change in N G is d N G /d N D = - N G / N Rewriting this we getd N G / N G = - ( 1/N) d N D Integrating and solving:ln(N G ) = -N D /N + C Since N G = N => N D = 0, C must be ln(N) N G / N = Yield = e - N D /N let defect density ( defects / cm 2 ) =  D N d =  D x wafer area =  D x A x N Yield = N g / N = e -  D A typically  D = 0.3 – 1.0 defect / cm 2

soc 2.20 Using yield to size a die to find the cost per die: 1. find N, the number of die on a wafer 2. find Yield 3. find N g = Yield x N 4. cost/die = wafer cost/ Ng

soc 2.21 Effect of defect density

soc 2.22 What can be put on the die? depends on the lithography and die area lithography determined by f, minimum feature size feature size is related to  the mask registration variation – f = 2

soc 2.23 Smallest device: 5 x

soc 2.24 Area Units: rbe and A rbe: small area unit for sizing functional units of the processor suppose we define another larger unit, A, as 1A =f 2 x 10 6,then 1A = 10 6 / 675 = 1481 rbe since 1481 is close to 1444 we can also refer to the simple register file as occupying 1 A

soc 2.25 Area of other cells 1 register bit = 1 rbe 1 CAM bit = 2 rbe 1 cache bit (6 tx cell) = 0.6 rbe 1 SRAM bit = 0.6 rbe 1 DRAM bit = 0.1 rbe = 67.5 f 2 These are the parameters for basic cells in most design tradeoffs

soc 2.26 Floorplan and area allocation Core processors Signal processor Cache Bus Memory Clock Test

soc 2.27 The baseline: I suppose  d is 0.2 defects /cm 2 and we target 80% yield then A = 110 mm 2 gross or (allowing 20%) guard 88 mm 2 net if f = 0.13  we have 5200 A area units for our design we want to realize –a 32b core processor (w 8kB I & 16kB D cache) –2 32b Vector proc. W 16 x 1k x 32 vector memory + I and D cache –128kB ROM – anything else is SRAM

soc 2.28 The baseline: II This leaves = 2538A available for data SRAM This implies about 512kB of SRAM

soc 2.29 Example SOC floorplan

soc 2.30 Die area summary cost: an exponential function of area successful business model –targets initial production at relatively low yield (~0.3) –ride learning curve and leverage technology to reduce cost and improve performance technical innovation and analysis –intersect with business decisions to make a product –use design feasibility studies and empirical targets –methodology for cost and performance evaluation –marketing targets: determine weighting of performance metrics

soc 2.31 Power consumption power consumption: becoming key design issue increased power: largely due to higher frequency operation

soc 2.32 Bipolar and CMOS clock frequency Bipolar power limit

soc 2.33 Bipolar cooling technology (ca ’91) Hitachi M880: 500 MHz; one processor/module, 40 die sealed in helium then cooled by a water jacket. Power consumed: about 800 watts per module. F. Kobayashi, et al. “Hardware technology for Hitachi M-880.” Proceedings Electronic Components and Tech Conf., 1991.

soc 2.34 Power: real price of performance As feature size & C (capacitance) decrease, the electric fields force a reduction in V. To maintain performance we also reduce V th So as V th decreases this increases I leakage and static power. Static power is now a big problem in high performance designs. Static power can be controlled by maintaining V th and using lower frequencies; also lowering V reduces dynamic power. Dynamic power Static power

soc 2.35 Power and frequency I = C dV/dt ….smaller C enables higher dV/dt (frequency) but I = (V - V th ) 1.25 /V and I also directly determines max. frequency. for V th = 0.6v, halving V also halves the frequency. (E.g. if V goes from 3 to 1.5v then freq is ½) so halving the voltage (V DD or the signal V) halves the frequency BUT reduces the power by 1/8 … (CV 2 f/2) so

soc 2.36 Power: a new frontier cooled high power: >70w/ die high power: w/ die … plug in supply low power: w / die.. rechargeable battery very low power: mw /die.. AA size batteries extremely low power: microwatt/die and below (nano watts).. button batteries no power: extract from local EM field, ….O (1uw/die)

soc 2.37 Battery energy and usage typeenergy capacity timepower recharage able 10,000 mAh 50 hours (10-20% duty) 400mw- 4w 2xAA4000 mAh ½ year (10- 20% duty) 1-10 mw button40mAh5 years (always on) 1uw

soc 2.38 Power is important! by scaling alone a 1000 x slower implementation may need only as much power gating power to functional units and other techniques should enable 100MHz processors to operate at O(10 -3 ) watts goal: O(10 -6 ) watts…. implies about 10 MHz

soc 2.39 design for reliability using –redundancy –error detect and correct –process recoverability –fail-safe computation failure: a deviation from a design specification error: a failure that results in an incorrect signal value fault: an error manifests as an incorrect logical result faults –do not necessarily produce incorrect program execution –can be masked by detection/correction logic, e.g. ecc codes types of faults: –physical fault –design fault Reliability + computational integrity

soc 2.40 Redundancy: carefully applied P(t) = e -t/  –derived in the same way as the yield equation TMR (triple modular redundancy) system –additional reliability over a time much less than  the expected failure time for a single module additional hardware –makes the occurrence of multiple module failures more probable

soc 2.41 Highly reliable designs typical usage –error detection: parity, residue, block codes; sanity & bounds checks –action (instruction) retry –error correction: code or alternate path compute –reconfiguration

soc 2.42 Why reconfigurable design? manage design complexity based on high-performance IP-blocks –avoid the risk and delay of fabrication time – support highly-pipelined designs area – regularity of FPGA, readily to advance to better process technology reliability – FPGA enables redundant cells and interconnections, avoid run-time faults

soc 2.43 Area estimate of FPGAs use rbe model as the basic measure –one slice  7000 transistors = 700 rbe –one logic element (LE)  = 1200 rbe –Xilinx Virtex XC2V6000 = 33,792 slices million rbe or 16400A 8 x 8 multiplier: around 35 slices –equivalent to rbe or 17A –1-bit multiplier in VLSI contains a full-adder and an AND gate  3840 transistors = 384 rbe around 60 times smaller than reconfigurable version block multipliers in FPGAs: more efficient

soc 2.44 Soft processors: using FPGAs soft processors  how soft they are? –an instruction processor design in bit-stream format, used to program an FPGA device –cost reduction, design reuse, … major soft processors include: –Altera: Nios –Xilinx: MicroBlaze –open-source: OpenRISC, Leon –all 32-bit RISC architecture with 5-stage pipelines, connect to different bus standards

soc 2.45 Features: soft processors

soc 2.46 Summary best optimise: time, area, power cycle time: optimized pipelining area: die floorplanning, rbe model power: cooling + battery implications reliability: computational integrity, redundancy reconfiguration: reduce risks and delays –area overhead alleviated by coarse-grained blocks –soft processors: instruction processors in FPGA