Presentation is loading. Please wait.

Presentation is loading. Please wait.

Digital Integrated Circuits A Design Perspective System on a Chip Design.

Similar presentations


Presentation on theme: "Digital Integrated Circuits A Design Perspective System on a Chip Design."— Presentation transcript:

1 Digital Integrated Circuits A Design Perspective System on a Chip Design

2 Application Specific Integrated Circuits: Introduction Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab. Dept. of ECE, Vada Lab.

3 Contents l Why ASIC? l Introduction to System On Chip Design l Hardware and Software Co-design l Low Power ASIC Designs

4 Why ASIC – Design productivity grows! Complexity increase 40 % per year Design productivity increase 15 % per year Integration of PCB on single die

5 Silicon in 2010 Die Area:2.5x2.5 cm Voltage:0.6 V Technology:0.07  m

6 ASIC Principles Value-added ASIC for huge volume opportunities; standard parts for quick time to market applications Value-added ASIC for huge volume opportunities; standard parts for quick time to market applications Economics of Design Economics of Design –Fast Prototyping, Low Volume –Custom Design, Labor Intensive, High Volume CAD Tools Needed to Achieve the Design Strategies CAD Tools Needed to Achieve the Design Strategies –System-level design: Concept to VHDL/C –Physical design VHDL/C to silicon, Timing closure (Monterey, Magma, Synopsys, Cadence, Avant!) Design Strategies: Hierarchy; Regularity; Modularity; Locality Design Strategies: Hierarchy; Regularity; Modularity; Locality

7 ASIC Design Strategies Design is a continuous tradeoff to achieve performance specs with adequate results in all the other parameters. Design is a continuous tradeoff to achieve performance specs with adequate results in all the other parameters. Performance Specs - function, timing, speed, power Performance Specs - function, timing, speed, power Size of Die - manufacturing cost Size of Die - manufacturing cost Time to Design - engineering cost and schedule Time to Design - engineering cost and schedule Ease of Test Generation & Testability - engineering cost, manufacturing cost, schedule Ease of Test Generation & Testability - engineering cost, manufacturing cost, schedule

8 ASIC Flow

9 Structured ASIC Designs Hierarchy: Subdivide the design into many levels of sub-modules Hierarchy: Subdivide the design into many levels of sub-modules Regularity: Subdivide to max number of similar sub-modules at each level Regularity: Subdivide to max number of similar sub-modules at each level Modularity: Define sub-modules unambiguously & well defined interfaces Modularity: Define sub-modules unambiguously & well defined interfaces Locality: Max local connections, keeping critical paths within module boundaries Locality: Max local connections, keeping critical paths within module boundaries

10 ASIC Design Options Programmable Logic Programmable Logic Programmable Interconnect Programmable Interconnect Reprogrammable Gate Arrays Reprogrammable Gate Arrays Sea of Gates & Gate Array Design Sea of Gates & Gate Array Design Standard Cell Design Standard Cell Design Full Custom Mask Design Full Custom Mask Design Symbolic Layout Symbolic Layout Process Migration - Retargeting Designs Process Migration - Retargeting Designs

11 ASIC Design Methodologies

12 Why SOC? Why SOC? SOC specs are coming from system engineers rather SOC specs are coming from system engineers rather than RTL descriptions than RTL descriptions SOC will bridge the gap hardware/software and their implementation in novel, energy-efficient silicon architecture.SOC will bridge the gap hardware/software and their implementation in novel, energy-efficient silicon architecture. In SOC design, chips are assembled at IP block level (design reusable) and IP interfaces rather than gate levelIn SOC design, chips are assembled at IP block level (design reusable) and IP interfaces rather than gate level

13 CMOS density now allows complete System-on-a-chip Solutions Viterbi Equal. Demod and sync phone book keypad intfc protocolcontrol de-intl & decoder RPE-LTP speech decoder speech quality enhancement voice recognition phone book DMA S/P DSP core  P core RAM & ROM Dedicated logic A D digital down conv Analog l FPGA l Reconfigurable Interconnect Also like to add Source: Brodersen, ICASSP ‘98 How do we design these chips How do we design these chips?

14 Possible Single-Chip Radio Architectures Software Radio GOAL: Simplify System Design Process Seek architectures which are flexible such that hardware and protocols can be designed independently APPROACH: Minimize the use of dedicated logic Universal Radio GOAL: Maximize Bandwidth Efficiency and Battery Life Seek architectures which perform complex algorithms very fast with minimal energy APPROACH: Minimize the use of programmable logic Why is SOC design so scary?

15 60 GHz SiGe Transceiver for Wireless LAN Applications A low power 30 GHz LNA is designed as the front end of the receiver. Wideband and high gain response is realized by a 2-stage design using a stagger-tuned technique. The simulated performance predicts a forward gain of |S21| > 20 dB over a 6 GHz range with an input match of |S11| 20 dB over a 6 GHz range with an input match of |S11| < -30 dB and output match of |S22| < -10 dB. The mixer consists of a single balanced Gilbert cell. A fully-integrated differential 25 GHz VCO is used, in conjunction with the mixer, to downconvert the RF input to a 5 GHz IF. 30 GHz receiver layout consisting of the LNA, mixer and VCO

16 Wideband CMOS LC VCO A 1.8 GHz wideband LC VCO implemented in 0.18 µm bulk CMOS has been successfully designed, fabricated, and measured. This VCO utilizes a 4-bit array of switched capacitors and a small accumulation-mode varactor to achieve a measured tuning range exceeding 2:1 (73%) and a worst- case tuning sensitivity of 270 MHz/V. The amplitude reference level is programmable by means of a 3-bit DAC. VCOs die photograph

17 A High Level View of an Industry Standard Design Flow l Every step can loop to every other step l Each step can take hours or days for a 100,000 line description l HDL description contains no physical information l Different engineers handle the front-end and back-end design HDL Entry good? Synthesis Floor-plan Place & Route Physical Verification DRC & LVS done good? good? good? source: Hitachi, Prof. R. W. Brodersen Problems with this flow: How have semiconductor companies made this flow work? Front-End Back-End

18 A More Accurate Picture of the Standard Flow l Architecture: Partition the chip into functional units and generate bit-true test vectors to specify the behavior of each unit TOOLS: Matlab, C, SPW, (VCC) FREEZE the test vectors l Front-End: Enter HDL code which matches the test vectors TOOLS: HDL Simulators, Design Compiler FREEZE the HDL code l Back-End: Create a floor-plan and tweak the tools until a successful mask layout is created TOOLS: Design Compiler, Floor-planners, Placers, Routers, Clock-tree generators, Physical Verification Architecture 10 months Front-End 10 months Back-End 2 months Fabrication 2 months Source: IBM Semiconductor, Prof. R. Newton How can we improve this flow?

19 Common Fabric for IP Blocks l Soft IP blocks are portable, but not as predictable as hard IP. l Hard IP blocks are very predictable since a specific physical implementation can be characterized, but are hard to port since are often tied to a specific process. l Common fabric is required for both portability and predictability. l Wide availability: Cell Based Array, metal programmable architecture that provides the performance of a standard cell and is optimized for synthesis.

20 Four main applications l Set-top box: Mobile multimedia system, base station for the home local-area network. l Digital PCTV: concurrent use of TV,3D graphics, and Internet services l Set-top box LAN service: Wireless home- networks, multi-user wireless LAN l Navigation system: steer and control traffic and/or goods-transportation l CMPR is a multipurpose program that can be used for displaying diffraction data, manual- & auto-indexing, peak fitting and other CMPR

21 PC-Multimedia Applications

22 Types of System-on-a-Chip Designs

23 Physical gap l Timing closure problem: layout-driven logic and RT- level synthesis l Energy efficiency requires locality of computation and storage: match for stream-based data processing of speech,images, and multimedia-system packets. l Next generation SOC designers must bridge the architectural gap b/w system specification and energy-efficient IP-based architectures, while CAE vendors and IP providers will bridge the physical gap.

24 Circular Y-Chart

25 SOC Co-Design Challenges l Current systems are complex and heterogenous Contain many different types of components l Half of the chip can be filled with 200 low-power, RISC-like processors (ASIP) interconnected by field- programmable buses, embedded in 20Mbytes of distributed DRAM and flash memory, Another Half: ASIC l Computational power will not result from multi-GHz clocking but from parallelism, with below 200 MHz. l This will greatly simplify the design for correct timing, testability, and signal integrity.

26 Bridging the architectural gap l One-M gate reconfigurable, one-M gate hardwired logic. l 50GIPS for programmable components or 500 GIPS for dedicated hardwares l Product reliability: design at a level far above the RT level, with reuse factors in excess of 100 l Trade-off: 100MOPs/watt (microprocessor) 100GOPs/watt (hardwired) Reconf. Computing with a large number of computing nodes and a very restricted instruction set (Pleiades)

27 Why Lower Power l Portable systems –long battery life –light weight –small form factor l IC priority list –power dissipation –cost –performance l Technology direction Reduced voltage/power designs based on mature high performance IC technology, high integration to minimize size, cost, power, and speed Reduced voltage/power designs based on mature high performance IC technology, high integration to minimize size, cost, power, and speed

28 year Power(W) i286 i386 DX 16 i486 DX25 i486 DX 50 i486 DX2 66 P-PC P6 166 P5 66 Alpha Alpha i486 DX4 100 P II 300 P-PC P-PC P III 500 Alpha Microprocessor Power Dissipation

29 Levels for Low Power Design

30 Power-hungry Applications l Signal Compression: HDTV Standard, ADPCM, Vector Quantization, H.263, 2-D motion estimation, MPEG-2 storage management l Digital Communications: Shaping Filters, Equalizers, Viterbi decoders, Reed-Solomon decoders

31 New Computing Platforms l SOC power efficiency more than 10GOPs/w –Higher On Chip System Integration: COTS: 100W, SOC:10W (inter-chip capacitive loads, I/O buffers) –Speed & Performance: shorter interconnection,fewer drivers,faster devices,more efficient processing artchitectures l Mixed signal systems l Reuse of IP blocks l Multiprocessor, configurable computing l Domain-specific, combined memory-logic

32 Low Power Design Flow I Function Partitioning and HW/SW Allocation System Level Specification System-Level Power Analysis Behavioral Description Software Functions Processor Selection Power-driven Behavioral Transformation Behavioral-Level Power Conscious Behavioral Description Power Analysis RT-Level High-Level Synthesis and Optimization Software Optimization Software-Level Power Analysis To RT-Level Design

33 Low Power Design Flow II RT-level Description RTL mapping Logic Synthesis and Optimization Gate-Level Power Analysis Gate-level Description Switch-Level High-Level Synthesis and Optimization RTL Library Data-path Controller Switch-level Description Standard cell Library Processor Control and Steering Logic Memory RTL Macrocells

34 Three Factors affecting Energy – Reducing waste by Hardware Simplification: redundant h/w extraction, Locality of reference,Demand-driven / Data-driven computation,Application-specific processing,Preservation of data correlations, Distributed processing – All in one Approach(SOC): I/O pin and buffer reduction – Voltage Reducible Hardwares – 2-D pipelining (systolic arrays) –SIMD:Parallel Processing:useful for data w/ parallel structure –VLIW: Approach- flexible

35 IBM’s PowerPC Lower Power Architecture IBM’s PowerPC Lower Power Architecture l Optimum Supply Voltage through Hardware Parallel, Pipelining,Parallel instruction execution –603e executes five instruction in parallel (IU, FPU, BPU, LSU, SRU) –FPU is pipelined so a multiply-add instruction can be issued every clock cycle –Low power 3.3-volt design l Use small complex instruction with smaller instruction length –IBM’s PowerPC 603e is RISC l Superscalar: CPI < 1 –603e issues as many as three instructions per cycle l Low Power Management –603e provides four software controllable power-saving modes. l Copper Processor with SOI l IBM’s Blue Logic ASIC :New design reduces of power by a factor of 10 times

36 Power-Down Techniques Power-Down Techniques Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work

37 Implementing Digital Systems

38 H/W and S/W Co-design

39 Three Co-Design Approaches l IFIP International Conference FORTE/PSTV’98, Nov.’98 N.S. Voros et.al, “Hardware - software co-design of embedded systems using multiple formalisms for application development ” l ASIP co-design: builds a specific programmable processor for an application, and translates the application into software code. H/w and s/w partitioning includes the instruction set design. l H/w s/w synchronous system co-design: s/w processor as a master controller, and a set of h/w accelerators as co- processors. Vulcan, Codes, Tosca, Cosyma l H/w s/w for distributed systems: mapping of a set of communication processors onto a set of interconnected processors. Behavioral decomposition, process allocation and communication transformation. Coware(powerful), Siera (reuse), Ptolemy (DSP)

40 Mixing H/W and S/W l Argument: Mixed hardware/ software systems represent the best of both worlds. High performance, flexibility, design reuse, etc. l Counterpoint: From a design standpoint, it is the worst of both worlds the worst of both worlds –Simulation: Problems of verification, and test become harder –Interface: Too many tools, too many interactions, too much heterogeneity –Hardware/ software partitioning is “AI- complete”! –(MIT, Stanford: by analogy with "NP-complete") A term used to describe problems in artificial intelligence, to indicate that the solution presupposes a solution to the "strong AI problem" (that is, the synthesis of a human-level intelligence). A problem that is AI-complete is just too hard. NP-completeartificial intelligenceNP-completeartificial intelligence

41 Low power partitioning approach l Different HW resources are invoked according to the instruction executed at a specific point in time l During the execution of the add op., ALU and register are used, but Multiplier is in idle state. l Non-active resources will still consume energy since the according circuit continue to switch l Calculate wasting energy l Adding application specific core and partial running Whenever one core performing, all the other cores are shut down Whenever one core performing, all the other cores are shut down

42 ASIP (Application Specific Instruction Processors) Design l Given a set of applications, determine micro architecture of ASIP (i. e., configuration of functional units in datapaths, instruction set) l To accurately evaluate performance of processor on a given application need to compile the application program onto the processor datapath and simulate object code. l The micro architecture of the processor is a design parameter!

43 ASIP Design Flow

44 Cross-Disciplinary nature l Software for low power:loop transformation leads to much higher temporal and spatial locality of data. l Code size becomes an important objective Software will eventually become a part of the chip l Behavior-platform-compiler codesign: codesigned with C++ or JAVA, describing their h/w and s/w implementation. l Multidisciplinary system thinking is required for future designs ( e.g., Eindhoven Embedded Systems Institute

45 VLSI Signal Processing Design Methodology l pipelining, parallel processing, retiming, folding, unfolding, look-ahead, relaxed look- ahead, and approximate filtering l bit-serial, bit-parallel and digit-serial architectures, carry save architecture l redundant and residue systems l Viterbi decoder, motion compensation, 2D- filtering, and data transmission systems

46 Low Power DSP l DO-LOOP Dominant VSELP Vocoder: 83.4 % 2D 8x8 DCT: 98.3 % LPC computation: 98.0 % DO-LOOP Power Minimization ==> DSP Power Minimization VSELP : Vector Sum Excited Linear Prediction LPC : Linear Prediction Coding

47 Deep-Submicron Design Flows l Rapid evaluation of complex designs for area and performance l Timing convergence via estimated routing parasitics l In-place timing repair without resynthesis l Shorter design intervals, minimum iterations l Block-level design and place and route l Localized changes without disturbance l Integration of complex projects and design reuse

48 SOC CAD Companies l Avant! l Cadence l Duet Tech l Escalade l Logic visions l Mentor Graphics l Palmchip l Sonic l Summit Design l Synopsys l Topdown design solutions l Xynetix Design Systems l Zuken-Redac

49 Design Technology for Low Power Radio Systems Rhett Davis Dept. of EECS Univ. of Calif. Berkeley

50 Domain of Interest l Highly integrated system-on-a-chip solutions – SOC’s l Wireless communications with associated processing, e.g. multimedia processing, compression, switching, etc… l Primary computation is high complexity dataflow with a relatively small amount of control

51 Why Systems-on-a-Chip - SOC ? State-of-the-Art CMOS is easily able to implement complete systems (or what was on a board before) – A microprocessor core is only 1-2 mm 2 (1-2 % of the area of a $4 chip) – Portability (size) is critical to meet the cost, power and size requirements of future wireless systems – Chips will be required to support the complete application (wireless internet, multimedia) – Dedicated stand-alone computation is replacing general purpose processors as the semiconductor industry driver

52 Analog Baseband Digital Baseband (DSP + MCU ) Power Management Small Signal RF Power RF Cellular Phones: An example Digital Cellular Market (Phones Shipped) Units 48M 86M 162M 260M 435M (Courtesy Mike McMahon, Texas Instruments)

53 Cellular Phone Baseband SOC MCU Gates Analog ROM DSP RAM phones on each 8” Leff 1Million Baseband Chips per Day!!! (Courtesy Mike McMahon, Texas Instruments)

54 Wireless System Design Issues l It is now possible to use CMOS to integrate all digital radio functions – but what is the “best” architectural way to use CMOS??? l Computation rates for wireless systems will easily range up to 100’s of GOPS in signal processing – What’s keeping us from achieving this in silicon? – What can we do about it?

55 Computational Efficiency Metrics l Definition: MOPS – Millions of algorithmically defined arithmetic operations (e.g. multiply, add, shift) – in a GP processor several instructions per “useful” operation l Figures of merit – MOPS/mW - Energy efficiency (battery life) – MOPS/mm 2 - Area efficiency (cost) Optimization of these “efficiencies” is the basic goal assuming functionality is met

56 Energy-Efficiency of Architectures Embedded  Processors Microprocessor.1-1 MIPS/mW ASIPs DSPs DSP 1-10 MIPS/mW Dedicated HW Flexibility (Coverage) Energy Efficiency MOPS/mW (or MIPS/mW) Reconfigurable Processor/Logic Reconfiguration (???) Potential of MOPS/mW Direct mapped MOPS/mW

57 Software Processors: Energy Trends Primary means of performance increase of software processors has been by increasing clock rate Decreasing Energy Efficiency i386 i486C-33 PP-100 A21064A MIPS R4400 SuperSparc2-90 PPC A PPro-150 PPC603e-100 PP166 MIPS R10000 PPro200 i386C-33 PP PPC HP PA7200 PP-133 UltraSparc-167 HP PA8000 MIPS R5000 DX Freq(MHz) E  C  V DD 2

58 Software Processors: Area Trends DSP processor with 1 multiplier (25 mm 2 ) 16x16 multiplier (.05 mm 2 ) Why time multiplex to save area if the overhead is much greater than the area saved???? l Increasing clock rate results in a memory bottleneck – addressed by bringing memory on-chip l Area is increasingly dominated by memory – degrading MOPs/mm 2

59 Parallelism is the answer, but … l Not by putting Von Neumann processors in parallel and programming with a sequential language – Attempts to do this have failed over and over again… – The parallel computer compiler problem is very difficult l Not by trying to capture parallelism at the instruction level – Superscalar, VLIW, etc… are very inefficient – Hardware can’t figure out the parallelism from a sequential language either The problem is the initial sequential description (e.g. C) which is poorly matched to highly parallel applications

60 What is really hapenning… While (i=0;i++:i

61 What can a fully parallel CMOS solution potentially do? In.25 micron a multiplier requires.05 mm 2 and 7pJ per operation at 1 V. Adders and registers are about 10 times smaller and 10 times lower energy Lets implement a 50mm 2,.25 micron chip using adders, registers and multipliers Lets implement a 50mm 2,.25 micron chip using adders, registers and multipliers We can have 2000 adders/registers and 200 multipliers in less than 1/2 of the chip, also assume 1/3 of power goes into clocks We can have 2000 adders/registers and 200 multipliers in less than 1/2 of the chip, also assume 1/3 of power goes into clocks 25 MHz clock (1 volt) gives ~50 Gops at 100mW 25 MHz clock (1 volt) gives ~50 Gops at 100mW 500 MOPS/mW and 1000 MOPS/mm MOPS/mW and 1000 MOPS/mm 2

62 Start with a parallel description of the algorithm…

63 Then directly map into hardware … Mult2 Mac2 Mult1 Mac1 S regX reg Add, Sub, Shift

64 Results in fully parallel solutions EnergyArea 64-point FFT Energy per Transform (nJ) 16-State Viterbi Decoder Energy per Decoded bit (nJ) 64-point FFT Transforms per second per unit area (Trans/ms/mm 2 ) 16-State Viterbi Decoder Decode rate per unit area (kb/s/mm 2 ) Direct-Mapped Hardware ,200200,000 FPGA Low-Power DSP High-Performance DSP (numbers taken from vendor-published benchmarks) Orders of magnitude lower efficiency even for an optimized processor architecture

65 Reasons software solutions seem attractive (1) Believed to reduce time-to-system-implementation (2) Provides flexibility (3) Locks the customers into an architecture they can’t change (4) Difficulty in getting dedicated SOC chips designed Are these good reasons???

66 (1) Believed to reduce time-to-system implementation l Software decreases time to get first prototype, but time to fully verified system is much longer (hardware is often ready but software still needs to be done) l Limitations of software prototype often sets the ultimate limit of the system performance l Software solutions can be shipped with bugs, not a real option for SOC

67 (2) Need flexibility l Software is not always flexible – Can be hard to verify l Flexibility does not imply software programmability – Domain specific design can have multiple modules, coefficients and local state control (the factor of 100 in efficiency) to address a range of applications – Reconfiguration of interconnect can achieve flexibility with high levels of efficiency

68 Flexibility without software Energy per Transform vs. FFT size Transforms per Second per mm 2 vs. FFT size * All results are scaled to 0.18  m

69 Reasons software solutions seem attractive (1) Believed to reduce time-to-system implementation (2) Provides flexibility (3) Locks the customers into an architecture they can’t change (4) Difficulty in getting dedicated SOC chips designed

70 Standard DSP-ASIC Design Flow l Three translations of design data l Requirements for re-verification at each stage l Uncontrolled looping when pipeline stalls Problems: Prohibitively Long Design Time for Direct Mapped Architectures Algorithm Design Floating-Point Simulation System/Architecture Design Fixed-Point Simulation Hardware/Front- End Design RTL Code Physical/Back- End Design Mask Layout Sequential Mixed Sequential & Structural Integer only, Structural w/ Sequential Leaf-cells Single-wire Connectivity w/ Timing Constraints

71 Direct Mapping Design Flow l Encourages iterations of layout l Controls looping l Reduces the flow to a single phase l Depends on fast automation Algorithm/System Simulation Front-End RTL Libraries Back-End Floorplan Automated Flow Mask Layout Performance Estimates

72 Déjà vu??? l An automated style of design with parameterized modules processed through foundries is just the reincarnation of good ole Silicon Compilation of >10 years ago l What happened? – A decline of research into design methodologies – A single dominant flow has resulted - the Verilog-Synopsys- Standard Cell – Lack of tool flows to support alternative styles of design – Research community lost access to technology – moved to highly sub-optimal processor and FPGA solutions

73 Capturing Design Decisions Categories: l Function - basic input-output behavior l Signal - physical signals and types l Circuit - transistors l Floorplan - physical positions How to get layout and performance estimates in a day? MAC reg. file add shift reg. file 

74 Simplified View of the Flow New Software: l Generation of netlists from a dataflow graph l Merging of floorplan from last iteration l Automatic routing and performance analysis l Automation of flow as a dependency graph (UNIX MAKE program) merge autoLayout elaborate netlist route layout dataflow graph floorplan macro library

75 Why Simulink? l Simulink is an easy sell to algorithm developers l Closely integrated with popular system design tool Matlab l Successfully models digital and analog circuits Time-Multiplexed FIR Filter

76 Modeling Datapath Logic l Discrete-Time (cycle accurate) l Fixed-Point Types (bit true) l Completely specify function and signal decisions l No need for RTL Multiply / Accumulate

77 l Extended finite state- machine editor l Co-simulation with dataflow graph l New Software: Stateflow-VHDL translator l No need for RTL Modeling Control Logic Address Generator / MAC Reset

78 Specifying Circuit Decisions l Macro choices embedded in dataflow graph l Cross-check simulations required Time-Multiplexed FIR Filter Stateflow- VHDL translator RTL Code or Data-path Generator Code or Custom Module Black Box

79 Hierarchy Hardened Progressively l Macro characterization saved for fast estimates l Each level of hierarchy becomes a new hard macro l Higher levels of hierarchy are adjusted l When top level of hierarchy is hardened, the design is done System-Level Design Environment estimate performance: power, area, delay Hard Macro Characterization Libraries layout and characterize new hard macro

80 Capturing Floorplan Decisions l Commercial physical design tools used l Instance names in floorplan match dataflow graph l Placements merged on each iteration l Manhattan distance can be used for parasitic estimates Parallel Pipelined FIR Filter

81 Reduced Impact of Interconnect Long wires can be modeled as lumped capacitances 0.18  m 0.18  m FO4 inv delay Wiredelay...

82 Race-Immune Clock Tree Synthesis Race margin = 580 ps 0.18  m 0.18  m l V DD = 1 V t < t - t skew(max)clk-Q(min)hold(max) Demonstrated on a 600k transistor design Example Clock Tree Stages: 22 Sinks: 7650 Skew: 320 ps Clock Power: 2.8 mW Logic Power: 21 mW Hierarchical Clock Tree Synthesis

83 180 MB 1.5 GB disk space(elaborate / route) (characterization) 3 hours 9 hours execution time(elaborate / route) (characterization) 240 k transistors 21 k cells 18.0 ns critical path delay (1 V, PathMill) 13.0 mW 25 MHz (1 V, PowerMill) 1.4 mm 2 area in 0.25  m parallel pipelined FIR filter Example 1: Macro Hardening Most time/disk space spent on extraction and power simulation

84 Example 2: Test Chip l 300k transistors l 0.25 mm l 1.0 V l 25 MHz l 6.8 mm 2 l 14 mW l 2 phase clock l 3 layers of P&R hierarchy Parallel Pipelined FIR Filter (8X decimation filter for 12-bit 200 MHz 

85 TDMA Baseband Receiver l 600k transistors l 0.18 mm l 1.0 V l 25 MHz l 1.1 mm 2 l 21 mW l single phase clock l 5 clock domains l 2 layers of P&R hierarchy carrier detection frequency estimation rotate & correlate control

86 Conclusions l Direct-Mapped hardware is the most efficient use of silicon l Direct-Mapped hardware can be easier to design and verify than embedded hardware/software systems l Don’t translate design data, refine it l Design with dataflow graphs, not sequential code l Design flow automation speeds up design space exploration

87 Embedded Processor Architectures and (Re)Configurable Computing Vandana Prabhu Professor Jan M. Rabaey Jan 10, 2000

88 Pico Radio Architecture Reconfigurable DataPath FPGA Embedded uP Dedicated FSM Dedicated DSP

89 Reconfigurable Computing: Merging Efficiency and Versatility “Hardware” customized to specifics of problem. Direct map of problem specific dataflow, control. Circuits “adapted” as problem requirements change. Spatially programmed connection of processing elements.

90 Matching Computation and Architecture Convolution Two models of computation: communicating processes + data-flow AddressGen Memory MAC Control Processor LCG Two architectural models: sequential control+ data-driven

91 Implementation Fabrics for Data Processing Data In 300 million multiplications/sec 357 million add-sub’s/sec Adaptive Pilot Correlator Digital Baseband Receiver DSP Power: 460mW Area: 1089mm 2 Power: 1500mW Area: 3600mm 2 Direct Mapped Power: 3mW Area: 1.3mm 2 Power: 10mW Area: 5mm 2 Pleiades Power: 18.49mW Area: 5.44mm 2 Power: 62.33mW Area: 21.34mm 2 16 Mmacs/mW!

92 Software Methodology Flow Algorithms Kernel Detection Estimation/Exploration Partitioning Software Compilation Reconfig. Hardware Mapping Interface Code Generation Power & Timing Estimation of Various Kernel Implementations Area & PDA Models Premapped Kernels Accelerator  proc & Timing Constraints Xform’s for low power Behavioral Kernels Executable Intemediate Form Interconnect Optimization Reconfig HW (Marlene Wan)

93 Maia: Reconfigurable Baseband Processor for Wireless 0.25um tech: 4.5mm x 6mm 1.2 Million transistors 40 MHz at 1V 1 mW VCELP voice coder Hardware 1 ARM-8 8 SRAMs & 8 AGPs 2 MACs 2 ALUs 2 In-Ports and 2 Out-Ports 14x8 FPGA

94 Implementation Fabrics for Protocols BUF Memory BUF Slot_Set_Tbl 2x16 addr slot_set Slot_no Slot start Pkt end RACH req RACH akn W_ENA R_ENA update idle write read slotset RACH idle A protocol = Extended FSM Intercom TDMA MAC ASIC: 1V, 0.25  m CMOS process ASIC: 1V, 0.25  m CMOS process FPGA: 1.5 V 0.25  m CMOS low-energy FPGA FPGA: 1.5 V 0.25  m CMOS low-energy FPGA l ARM8: 1 V 25 MHz processor; n = 13,000 l Ratio: >> 400 Idea: Exploit model of computation: concurrent finite state machines, communicating through message passing

95 Low-Power FPGA l Low Energy Embedded FPGA (Varghese George) l Test chip –8x8 CLB array –5 in - 3 out CLB –3-level interconnect hierarchy –4 mm 2 in 0.25  m ST CMOS –0.8 and 1.5 V supply l Simulation Results –125 MHz Toggle Frequency –50 MHz 8-bit adder –energy 70 times lower than comparable Xilinx

96 An Energy-Efficient µP System Integrated dc-dc converter Dynamic Voltage Scaling (Trevor Pering & Tom Burd) µProc. Speed Lower speed, Lower voltage, Lower energy Before Idle After

97 Xtensa Configurable Processor l Xtensa (Tensilica,Inc) for embedded CPU –Configurability allows designer to keep “minimal” hardware overhead –ISA (compatible with 32 bit RISC) can be extended for software optimizations –Fully synthesizable –Complete HW/SW suite l VCC modeling for exploration –Requires mapping of “fuzzy” instructions of VCC processor model to real ISA –Requires multiple models depending on memory configuration –ISS simulation to validate accuracy of model (Vandana Prabhu)

98 Microprocessor Optimizations for Network Protocols Transport layer on configurable processor l ImplementsTransport layer on configurable processor –TDMA control and channel usage management l Upper layer of protocol is dominated by processor control flow –Memory routines, Branches, Procedure calls l Artifacts of code generation tools is significant l Excessively modular code introduces procedure calls l Uses dynamic memory allocation l Configurable processor l Increased size of register file l Customized instructions help datapath but not control (Kevin Camera & Tim Tuan ) Total Execution Time callocmemcpyother Memory Routines Efficient implementaion at code generation and architecture levels!

99 Implementation Methodology for Reconfigurable Wireless Protocol l Changing granularity within protocol stack requires estimation tool for energy-efficient implementation l Software exploration on processors –Exploring Xtensa’s TIE l Hardware exploration on FPGA platforms –Optimal FPGA architecture –Alternately “Reconfigurable FSM” analogous to Pleiades approach for datapath kernels (Suetfei Li & Tim Tuan)

100 TCI - A First Generation PicoNode Tensilica Embedded Proc. Tensilica Embedded Proc. Memory Sub-system Memory Sub-system Baseband Processing Configurable Logic (Physical Layer) Configurable Logic (Physical Layer) Programmable Protocol Stack Programmable Protocol Stack Sonics Backplane

101 The System-on-a-Chip Nightmare Bridge DMACPUDSP Mem Ctrl. MPEG C IOO System Bus Peripheral Bus Control Wires Custom Interfaces The “Board-on-a-Chip” Approach Courtesy of Sonics, Inc

102 The Communications Perspective DSP MPEG CPU DMA C MEM IO Example: “The Silicon Backplane” (Sonics, Inc) Open Core Protocol TM SiliconBackplane Agent TM Communications-based Design Guaranteed Bandwidth Arbitration (Mike Sheets)

103 Summary l Design for low-energy impacts all stages of the design process — the earlier the better l Energy reduction requires clear communication and computation abstractions l Efficient and abstract modeling of energy at behavior and architecture level is crucial l Efficient hardware implementation of protocol stack l Beat the SoC monster!

104 Targeting Tiled Architectures in Design Exploration Lilian Bossuet 1, Wayne Burleson 2, Guy Gogniat 1, Vikas Anand 2, Andrew Laffely 2, Jean-Luc Philippe 1 Vikas Anand 2, Andrew Laffely 2, Jean-Luc Philippe 1 1 LESTER Lab Université de Bretagne Sud Lorient, France {lilian.bossuet, guy.gogniat, jean- v-ubs.fr 2 Department of Electrical and Computer Engineering University of Massachusetts, Amherst, USA {burleson, vanand,

105 Design Space Exploration: Motivations l Design solutions for new telecommunication and multimedia applications targeting embedded systems l Optimization and reduction of SoC power consumption l Increase computing performance –Increase parallelism –Increase speed l Be flexible –Take into account run-time reconfiguration –Targeting multi-granularity (heterogeneous) architectures

106 Design Space Exploration: Flow l Progressive design space reduction: –iterative exploration –refinement of architecture model –increase of performance estimation accuracy l One level of abstraction for one level of estimation accuracy

107 Reconfigurable Architectures l Bridging the flexibility gap between ASICs and microprocessor [Hartenstein DATE 2001] l Energy efficient and solution to low power programmable DSP [Rabaey ICASSP 1997, FPL 2000] l Run Time Reconfigurable [Compton & Hauck 1999] l => A key ingredient for future silicon platforms [Schaumont & all. DAC 2001]

108 Design Space of Reconfigurable Architecture RECONFIGURABLE ARCHITECTURES (R-SOC) FINE GRAIN (FPGA) MULTI GRANULARITY (Heterogeneous) COARSE GRAIN (Systolic) Processor + Coprocessor Tile-BasedArchitecture Coarse Grain Coprocessor Fine Grain CoprocessorIslandTopology Hierarchical Topology LinearTopologyHierarchicalTopologyMeshTopology Chameleon Chameleon REMARC REMARC Morphosys Morphosys Pleiades Pleiades Garp Garp FIPSOC FIPSOC Triscend E5 Triscend E5 Triscend A7 Triscend A7 Xilinx Virtex-II Pro Xilinx Virtex-II Pro Altera Excalibur Altera Excalibur Atmel FPSIC Atmel FPSIC Xilinx Virtex Xilinx Virtex Xilinx Spartran Xilinx Spartran Atmel AT40K Atmel AT40K Lattice ispXPGA Altera Stratix Altera Stratix Altera Apex Altera Apex Altera Cyclone Altera Cyclone Systolic Ring Systolic Ring RaPiD RaPiD PipeRench PipeRench DART DART FPFA FPFA RAW RAW CHESS CHESS MATRIX MATRIX KressArray KressArray Systolix Pulsedsp Systolix Pulsedsp aSoC aSoC E-FPFA E-FPFA

109 A Target Architecture: aSoC l Adaptive System-on-a-Chip (aSoC) l Tiled architecture containing many heterogeneous processing cores (RISC, DSP, FPGA, Motion Estimation, Viterbi Decoder) l Mesh communication network controlled with statically determined communication schedule l A scalable architecture.

110 FPGA in System-on-a-Chip l Fast Time-To-Market l Post-Fabrication Customization –Broaden application domain –Run-time Reconfiguration –Bug Fixes –Upgrades l 10x-100x Worse: –Area –Performance –Power Mark L. Chang

111 tile FPGA uProc MUL l Heterogeneo us Cores aSoC Architecture l Point-to-point connections ctrl SouthCore West North East l Communicati on Interface

112 aSoC Communications Interface Core Coreports Decoder Local Frequency & Voltage North to South & East Instruction Memory PC Controller North South East West LocalConfig. North South East West Inputs Outputs l Interface Crossbar –inter-tile transfer –tile to core transfer l Interconnect/Instruction Memory –contains instructions to configure the interface crossbar (cycle-by-cycle) l Interface Controller –selects the instruction l Coreports –data interface and storage for transfers with the tile IP core l Dynamic Voltage and Frequency Selection –Dynamic Power Management Interface Crossbar

113 aSoC Exploration... l Type of tiles l Number of each type of tile l Placement of the tiles l Intern architecture of reconfigurable tiles (FPGA core) l Communication scheduling

114 Design Space Exploration: Goals l Goal: Rapid exploration of various architectural solutions to be implemented on heterogeneous reconfigurable architectures (aSoC) in order to select the most efficient architecture for one or several applications l Take place before architectural synthesis (algorithmic specification with high level abstraction language) l Estimations are based on a functional architecture model (generic, technology-independent) l Iterative exploration flow to progressively refine the architecture definition, from a coarse model to a dedicated model

115 Design Exploration Flow Targeting Tiled Architecture

116 Application Analysis l Use of algorithmic metrics and dedicated scheduling algorithms to highlight the target architectures l Algorithmic metrics: –Characterize the application orientation ProcessingProcessing MemoryMemory ControlControl –Characterize the application potential parallelism ProcessingProcessing MemoryMemory

117 Tile Exploration: with 3 steps l Projection: –Link between necessary resources (application) and available resources (tile) –Use of an allocation algorithm based on communication costs reduction l Composition: –Take into account of the function scheduling to estimate additional resources (register, mux, …) l Estimation: –performance interval computation (lower and upper bounds) –speed/resource utilization/power characterization

118 aSoC Builder l Environment AppMapper l Partition and assignment –based on Run Time Estimation l Compilation –Communication Scheduling –Core compilation l Generate tiles configuration –Communications instructions –Bitstreams (for reconfigurable tile) –RISC instructions

119 aSoC Analysis l Use the results of previous steps –Functions scheduling –Tile allocation –Communication scheduling l Complete estimation of the proposed solution –Global execution time –Global power consumption –Total area

120 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture Conference 30 Jan 2003 {alaffely, jliang, tessier, moritz, This material is based upon work supported by the National Science Foundation under Grant No Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

121 Adaptive System-on-a-Chip l Tiled architecture with mesh interconnect –Point to point communication pipeline l Allows for heterogeneous cores –Differing sizes, clock rates, voltages l Low-overhead core interface for –On-chip bus substitute for streaming applications l Based on static scheduling –Fast and predictable  Proc Tile Multiplier FPGA Multiplier ctrl South Core West North East Communication Interface

122 aSoC Implementation  technology Full custom

123 Some Results l 9 and 16 core systems tested for IIR, MPEG encoding and Image processing applications –~ 2 x the performance compared to Coreconnect bus Burst and Hierarchical –~ 1.5 x the performance of an oblivious routing network 1 (Dynamic routing) –Max speedup is 5 x 1. W. Dally and H. Aoki, “Deadlock-free Adaptive Routing in Multi-computer Networks Using Virtual Routing”, IEEE Transactions on Parallel and Distributed Systems, April 1993


Download ppt "Digital Integrated Circuits A Design Perspective System on a Chip Design."

Similar presentations


Ads by Google