Digital Integrated Circuits A Design Perspective

Digital Integrated Circuits A Design Perspective
System on a Chip Design

Application Specific Integrated Circuits: Introduction
Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab.

Contents Why ASIC? Introduction to System On Chip Design
Hardware and Software Co-design Low Power ASIC Designs

Why ASIC – Design productivity grows!
Complexity increase 40 % per year Design productivity increase 15 % per year Integration of PCB on single die

Silicon in 2010 Die Area: 2.5x2.5 cm Voltage: 0.6 V
Technology: 0.07 m

ASIC Principles Value-added ASIC for huge volume opportunities; standard parts for quick time to market applications Economics of Design Fast Prototyping, Low Volume Custom Design, Labor Intensive, High Volume CAD Tools Needed to Achieve the Design Strategies System-level design: Concept to VHDL/C Physical design VHDL/C to silicon, Timing closure (Monterey, Magma, Synopsys, Cadence, Avant!) Design Strategies: Hierarchy; Regularity; Modularity; Locality

ASIC Design Strategies
Design is a continuous tradeoff to achieve performance specs with adequate results in all the other parameters. Performance Specs - function, timing, speed, power Size of Die - manufacturing cost Time to Design - engineering cost and schedule Ease of Test Generation & Testability - engineering cost, manufacturing cost, schedule

ASIC Flow

Structured ASIC Designs
Hierarchy: Subdivide the design into many levels of sub-modules Regularity: Subdivide to max number of similar sub-modules at each level Modularity: Define sub-modules unambiguously & well defined interfaces Locality: Max local connections, keeping critical paths within module boundaries

ASIC Design Options Programmable Logic Programmable Interconnect
Reprogrammable Gate Arrays Sea of Gates & Gate Array Design Standard Cell Design Full Custom Mask Design Symbolic Layout Process Migration - Retargeting Designs

ASIC Design Methodologies

Why SOC? SOC specs are coming from system engineers rather
than RTL descriptions SOC will bridge the gap hardware/software and their implementation in novel, energy-efficient silicon architecture. In SOC design, chips are assembled at IP block level (design reusable) and IP interfaces rather than gate level

CMOS density now allows complete System-on-a-chip Solutions
Viterbi Equal. Demod and sync phone book keypad intfc protocol control de-intl & decoder RPE-LTP speech quality enhancement voice recognition phone book DMA S/P DSP core mP core RAM & ROM Dedicated logic A D digital down conv Analog Source: Brodersen, ICASSP ‘98 Also like to add FPGA Reconfigurable Interconnect How do we design these chips?

Possible Single-Chip Radio Architectures
Software Radio GOAL: Simplify System Design Process Seek architectures which are flexible such that hardware and protocols can be designed independently APPROACH: Minimize the use of dedicated logic Universal Radio GOAL: Maximize Bandwidth Efficiency and Battery Life Seek architectures which perform complex algorithms very fast with minimal energy APPROACH: Minimize the use of programmable logic Why is SOC design so scary?

60 GHz SiGe Transceiver for Wireless LAN Applications
A low power 30 GHz LNA is designed as the front end of the receiver. Wideband and high gain response is realized by a 2-stage design using a stagger-tuned technique. The simulated performance predicts a forward gain of |S21| > 20 dB over a 6 GHz range with an input match of |S11| < -30 dB and output match of |S22| < -10 dB. The mixer consists of a single balanced Gilbert cell. A fully-integrated differential 25 GHz VCO is used, in conjunction with the mixer, to downconvert the RF input to a 5 GHz IF. 30 GHz receiver layout consisting of the LNA, mixer and VCO

Wideband CMOS LC VCO VCOs die photograph
A 1.8 GHz wideband LC VCO implemented in 0.18 µm bulk CMOS has been successfully designed, fabricated, and measured. This VCO utilizes a 4-bit array of switched capacitors and a small accumulation-mode varactor to achieve a measured tuning range exceeding 2:1 (73%) and a worst-case tuning sensitivity of 270 MHz/V. The amplitude reference level is programmable by means of a 3-bit DAC. VCOs die photograph

A High Level View of an Industry Standard Design Flow
source: Hitachi, Prof. R. W. Brodersen Front-End Back-End HDL Entry good? Synthesis Floor-plan Place & Route Physical Verification DRC & LVS done Problems with this flow: Every step can loop to every other step Each step can take hours or days for a 100,000 line description HDL description contains no physical information Different engineers handle the front-end and back-end design How have semiconductor companies made this flow work?

A More Accurate Picture of the Standard Flow
Source: IBM Semiconductor, Prof. R. Newton Architecture 10 months Front-End Back-End 2 months Fabrication 2 months Architecture: Partition the chip into functional units and generate bit-true test vectors to specify the behavior of each unit TOOLS: Matlab, C, SPW, (VCC) FREEZE the test vectors Front-End: Enter HDL code which matches the test vectors TOOLS: HDL Simulators, Design Compiler FREEZE the HDL code Back-End: Create a floor-plan and tweak the tools until a successful mask layout is created TOOLS: Design Compiler, Floor-planners, Placers, Routers, Clock-tree generators, Physical Verification How can we improve this flow?

Common Fabric for IP Blocks
Soft IP blocks are portable, but not as predictable as hard IP. Hard IP blocks are very predictable since a specific physical implementation can be characterized, but are hard to port since are often tied to a specific process. Common fabric is required for both portability and predictability. Wide availability: Cell Based Array, metal programmable architecture that provides the performance of a standard cell and is optimized for synthesis.

Four main applications
Set-top box: Mobile multimedia system, base station for the home local-area network. Digital PCTV: concurrent use of TV,3D graphics, and Internet services Set-top box LAN service: Wireless home-networks, multi-user wireless LAN Navigation system: steer and control traffic and/or goods-transportation CMPR is a multipurpose program that can be used for displaying diffraction data, manual- & auto-indexing, peak fitting and other

PC-Multimedia Applications

Types of System-on-a-Chip Designs

Physical gap Timing closure problem: layout-driven logic and RT-level synthesis Energy efficiency requires locality of computation and storage: match for stream-based data processing of speech,images, and multimedia-system packets. Next generation SOC designers must bridge the architectural gap b/w system specification and energy-efficient IP-based architectures, while CAE vendors and IP providers will bridge the physical gap.

Circular Y-Chart

SOC Co-Design Challenges
Current systems are complex and heterogenous Contain many different types of components Half of the chip can be filled with 200 low-power, RISC-like processors (ASIP) interconnected by field-programmable buses, embedded in 20Mbytes of distributed DRAM and flash memory, Another Half: ASIC Computational power will not result from multi-GHz clocking but from parallelism, with below 200 MHz. This will greatly simplify the design for correct timing, testability, and signal integrity.

Bridging the architectural gap
One-M gate reconfigurable, one-M gate hardwired logic. 50GIPS for programmable components or 500 GIPS for dedicated hardwares Product reliability: design at a level far above the RT level, with reuse factors in excess of 100 Trade-off: 100MOPs/watt (microprocessor) 100GOPs/watt (hardwired) Reconf. Computing with a large number of computing nodes and a very restricted instruction set (Pleiades)

Why Lower Power Portable systems long battery life light weight
small form factor IC priority list power dissipation cost performance Technology direction Reduced voltage/power designs based on mature high performance IC technology, high integration to minimize size, cost, power, and speed

Microprocessor Power Dissipation
year Power(W) 1980 1985 1990 1995 2000 10 20 30 40 50 5 15 25 35 45 i286 i386 DX 16 i486 DX25 i486 DX 50 i486 DX2 66 P-PC601 50 P6 166 P5 66 Alpha Alpha 21164 i486 DX4 100 P II 300 P-PC P-PC P III 500 Alpha 21264

Levels for Low Power Design

Power-hungry Applications
Signal Compression: HDTV Standard, ADPCM, Vector Quantization, H.263, 2-D motion estimation, MPEG-2 storage management Digital Communications: Shaping Filters, Equalizers, Viterbi decoders, Reed-Solomon decoders

New Computing Platforms
SOC power efficiency more than 10GOPs/w Higher On Chip System Integration: COTS: 100W, SOC:10W (inter-chip capacitive loads, I/O buffers) Speed & Performance: shorter interconnection,fewer drivers,faster devices,more efficient processing artchitectures Mixed signal systems Reuse of IP blocks Multiprocessor, configurable computing Domain-specific, combined memory-logic

Low Power Design Flow I Function System System-Level Partitioning and
HW/SW Allocation System Level Specification System-Level Power Analysis Behavioral Description Software Functions Processor Selection Power-driven Transformation Behavioral-Level Power Conscious RT-Level High-Level Synthesis and Optimization Software-Level To RT-Level Design

Low Power Design Flow II
RT-level Description RTL mapping Logic Synthesis and Optimization Gate-Level Power Analysis Gate-level Switch-Level High-Level Synthesis and Library Data-path Controller Switch-level Standard cell Processor Control and Steering Logic Memory Macrocells

Three Factors affecting Energy
Reducing waste by Hardware Simplification: redundant h/w extraction, Locality of reference,Demand-driven / Data-driven computation,Application-specific processing,Preservation of data correlations, Distributed processing All in one Approach(SOC): I/O pin and buffer reduction Voltage Reducible Hardwares 2-D pipelining (systolic arrays) SIMD:Parallel Processing:useful for data w/ parallel structure VLIW: Approach- flexible

IBM’s PowerPC Lower Power Architecture
Optimum Supply Voltage through Hardware Parallel, Pipelining ,Parallel instruction execution 603e executes five instruction in parallel (IU, FPU, BPU, LSU, SRU) FPU is pipelined so a multiply-add instruction can be issued every clock cycle Low power 3.3-volt design Use small complex instruction with smaller instruction length IBM’s PowerPC 603e is RISC Superscalar: CPI < 1 603e issues as many as three instructions per cycle Low Power Management 603e provides four software controllable power-saving modes. Copper Processor with SOI IBM’s Blue Logic ASIC :New design reduces of power by a factor of 10 times

Power-Down Techniques
Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work

Implementing Digital Systems

H/W and S/W Co-design

Three Co-Design Approaches
IFIP International Conference FORTE/PSTV’98, Nov.’98 N.S. Voros et.al, “Hardware -software co-design of embedded systems using multiple formalisms for application development” ASIP co-design: builds a specific programmable processor for an application, and translates the application into software code. H/w and s/w partitioning includes the instruction set design. H/w s/w synchronous system co-design: s/w processor as a master controller, and a set of h/w accelerators as co-processors. Vulcan, Codes, Tosca, Cosyma H/w s/w for distributed systems: mapping of a set of communication processors onto a set of interconnected processors. Behavioral decomposition, process allocation and communication transformation. Coware(powerful), Siera (reuse), Ptolemy (DSP)

Mixing H/W and S/W Argument: Mixed hardware/ software systems
represent the best of both worlds. High performance, flexibility, design reuse, etc. Counterpoint: From a design standpoint, it is the worst of both worlds Simulation: Problems of verification, and test become harder Interface: Too many tools, too many interactions, too much heterogeneity Hardware/ software partitioning is “AI- complete”! (MIT, Stanford: by analogy with "NP-complete") A term used to describe problems in artificial intelligence, to indicate that the solution presupposes a solution to the "strong AI problem" (that is, the synthesis of a human-level intelligence). A problem that is AI-complete is just too hard.

Low power partitioning approach
Different HW resources are invoked according to the instruction executed at a specific point in time During the execution of the add op., ALU and register are used, but Multiplier is in idle state. Non-active resources will still consume energy since the according circuit continue to switch Calculate wasting energy Adding application specific core and partial running Whenever one core performing, all the other cores are shut down

ASIP (Application Specific Instruction Processors) Design
Given a set of applications, determine micro architecture of ASIP (i. e., configuration of functional units in datapaths, instruction set) To accurately evaluate performance of processor on a given application need to compile the application program onto the processor datapath and simulate object code. The micro architecture of the processor is a design parameter!

ASIP Design Flow

Cross-Disciplinary nature
Software for low power:loop transformation leads to much higher temporal and spatial locality of data. Code size becomes an important objective Software will eventually become a part of the chip Behavior-platform-compiler codesign: codesigned with C++ or JAVA, describing their h/w and s/w implementation. Multidisciplinary system thinking is required for future designs (e.g., Eindhoven Embedded Systems Institute

VLSI Signal Processing Design Methodology
pipelining, parallel processing, retiming, folding, unfolding, look-ahead, relaxed look-ahead, and approximate filtering bit-serial, bit-parallel and digit-serial architectures, carry save architecture redundant and residue systems Viterbi decoder, motion compensation, 2D-filtering, and data transmission systems

Low Power DSP DO-LOOP Dominant VSELP Vocoder : 83.4 %
2D 8x8 DCT : 98.3 % LPC computation : 98.0 % DO-LOOP Power Minimization ==> DSP Power Minimization VSELP : Vector Sum Excited Linear Prediction LPC : Linear Prediction Coding

Deep-Submicron Design Flows
Rapid evaluation of complex designs for area and performance Timing convergence via estimated routing parasitics In-place timing repair without resynthesis Shorter design intervals, minimum iterations Block-level design and place and route Localized changes without disturbance Integration of complex projects and design reuse

SOC CAD Companies Synopsys www.synopsys.com
Avant! Cadence Duet Tech Escalade Logic visions Mentor Graphics Palmchip Sonic Summit Design Synopsys Topdown design solutions Xynetix Design Systems Zuken-Redac

Design Technology for Low Power Radio Systems
Rhett Davis Dept. of EECS Univ. of Calif. Berkeley Good Morning, Ladies and Gentlemen, my name is Rhett Davis, and I come from the Berkeley Wireless Research Center at UC Berkeley. Today I will present to you our view of the big issues in the design of Low Power Radio Systems. We’re going to look mainly at the digital portion of the radio and how architectural choices affect performance and the difficulty of the design problem. I’ll also present some of the work I’ve been doing at the BWRC to improve performance and reduce the difficulty.

Domain of Interest Highly integrated system-on-a-chip solutions – SOC’s Wireless communications with associated processing, e.g. multimedia processing, compression, switching, etc… Primary computation is high complexity dataflow with a relatively small amount of control Now, to set the stage, let me point out the key aspects of the digital radio domain. This domain involves the design of highly integrated system-on-a-chip solutions. The application is wireless communications with associated signal processing, and that includes multimedia processing, compressing, any computation that needs to happen in a portable device. When we look more closely at these applications, we see that they involve mostly high-complexity dataflow with a relatively small amount of control… which is an ideal task for a system-on-a-chip, right?

Why Systems-on-a-Chip - SOC ?
State-of-the-Art CMOS is easily able to implement complete systems (or what was on a board before) A microprocessor core is only 1-2 mm (1-2 % of the area of a $4 chip) Portability (size) is critical to meet the cost, power and size requirements of future wireless systems Chips will be required to support the complete application (wireless internet, multimedia) Dedicated stand-alone computation is replacing general purpose processors as the semiconductor industry driver So let’s talk more about the system-on-a-chip design issues. CMOS technology has advanced to the point that a microprocessor core can take up only 1-2% of the total chip area, and since portability is key to success of consumer electronics, it becomes attractive to use this capability to reduce the cost, power and size of wireless systems. As such, chips are now expected expected to support a complete application rather than just being a building block. And one of the chief indicators of this is the fact that dedicated stand-alone computation is replacing general purpose processors as the semiconductor industry driver.

Digital Cellular Market
Cellular Phones: An example Analog Baseband Digital Baseband (DSP + MCU) Power Management Small Signal RF RF Digital Cellular Market (Phones Shipped) Units 48M 86M 162M 260M 435M For example, look at how digital cell-phone sales have been going. Close to half-a-billion sold last year. And each one of these phones contains a digital baseband chip. So, what does this digital baseband chip look like? (Courtesy Mike McMahon, Texas Instruments)

(Courtesy Mike McMahon, Texas Instruments)
Cellular Phone Baseband SOC ROM MCU DSP Gates RAM Analog It’s exactly the kind of system-on-a-chip that we’ve been talking about: two software programmable processor cores on the same die, RAM and ROM, with some dedicated logic and a small analog portion. This seems like a very specialized chip, and yet Texas Instruments shiped on average 1 million of these chips per day last year. So, given that wireless baseband signal processing is a hot area, let’s look at the big issues in the design of these chips. 2000+ phones on each 8” .15 Leff 1Million Baseband Chips per Day!!! (Courtesy Mike McMahon, Texas Instruments)

Wireless System Design Issues
It is now possible to use CMOS to integrate all digital radio functions – but what is the “best” architectural way to use CMOS??? Computation rates for wireless systems will easily range up to 100’s of GOPS in signal processing What’s keeping us from achieving this in silicon? What can we do about it? First of all, since we are now able to integrate all of the digital radio functions on the same die, what is the “best” architecture to use for these systems? How much better could we be doing with silicon than we are right now? And secondly, when we look to the future, we see that there are many emerging algorithms which promise much more efficient use of bandwidth than today’s systems, but they are extremely complex. We foresee that portable wireless devices will soon need 100’s of GOPS in signal processing. Control and user interfaces are still important but aren’t getting that much more complex as far as silicon is concerned. So, what, today, is keeping us from achieving this rate of computation and what can we do about it?

Computational Efficiency Metrics
Definition: MOPS Millions of algorithmically defined arithmetic operations (e.g. multiply, add, shift) – in a GP processor several instructions per “useful” operation Figures of merit MOPS/mW - Energy efficiency (battery life) MOPS/mm2 - Area efficiency (cost) Optimization of these “efficiencies” is the basic goal assuming functionality is met So, first things first, how do we quantify the quality of a digital architecture? We’ll begin with the rather common metric MOPS, Mega-Operations per second, which fits since we need to ensure that a certain amount of computation happens in the space of some real-time deadlines. For our purposes, however, we’re going to think purely about algorithmically defined arithmetic operations, such as multiply, add and shift. That means that operations such as a move from memory or a jump-to-subroutine don’t really count. They’re just instructions which are part of the implementation and don’t really matter in terms of the dedicated stand-alone computation that we’re trying to accomplish. From here we can define figures of merit: energy efficiency in terms of MOPS/mW and area efficiency in terms of MOPS/mm2. So, what we call an “operation” may change from one algorithm to the next, but we still have a good basis for comparing architectures.

Energy-Efficiency of Architectures
Embedded mProcessors Microprocessor .1-1 MIPS/mW ASIPs DSPs DSP 1-10 MIPS/mW Dedicated HW Flexibility (Coverage) MOPS/mW (or MIPS/mW) Energy Efficiency 0.1 1 10 100 1000 Reconfigurable Processor/Logic Reconfiguration (???) Potential of MOPS/mW Direct mapped MOPS/mW Now, when we compare the architectures, we see that they differ most significantly in terms of energy efficiency and flexibility. Embedded micro-processors are by far the least efficient, around 0.1 to 1 MOPS/mW. They’re also the most flexible, and this flexibility seems to be directly related to its inefficiency. As we look at more specialized architectures such as programmable DSP’s, we see that they are cover a slightly smaller range of applications but are around an order of magnitude more efficient. If we eliminate all flexibility, however, and create an architecture specifically for the given algorithm, we can achieve an efficiency of around 3 orders of magnitude better than micro-processors. We’ve theorized for a while that there must be a fourth style of architecture which fills this gap, begin more specialized than DSP’s and slightly more energy efficient, achieved perhaps with reconfigurable interconnect, however, our investigations into this area have had roughly the same efficiency as highly optimized DSP’s. Now, this raises some interesting questions: First of all, what makes the programmable architectures so inefficient? Secondly, aren’t processors always getting better? As we look to the future, is this efficiency gap going to get larger or smaller?

Software Processors: Energy Trends
i386 i486C-33 PP-100 A21064A MIPS R4400 SuperSparc2-90 PPC A PPro-150 PPC603e-100 PP166 MIPS R10000 PPro200 i386C-33 PP-66 486-66 PPC HP PA7200 PP-133 UltraSparc-167 HP PA8000 MIPS R5000 DX4 100 50 100 150 200 250 300 Freq(MHz) 1991 1992 1993 1994 1995 1996 When we look at the trends in software processors, we see that the primary means of performance increase has been by increasing the clock rate. This graph stops at 1996, but it’s 2001 now, and we’re in the GHz. Now, there’s a big problem with this approach, and that is, in order to increase the clock rate, you have to increase the supply voltage. And increasing the supply voltage means decreasing the energy efficiency. If we think of every algorithm as having an essential amount of capacitance that needs to be switched, then we’d like to switch it at the lowest possible voltage so that we get the longest battery life. (Example?) Since processors have to increase their clock rate to achieve more performance, we can predict that the energy efficiency of software programmable processors is continually decreasing relative to dedicated hardware. Primary means of performance increase of software processors has been by increasing clock rate Decreasing Energy Efficiency E  C  VDD2 4

Software Processors: Area Trends
Increasing clock rate results in a memory bottleneck – addressed by bringing memory on-chip Area is increasingly dominated by memory – degrading MOPs/mm2 16x16 multiplier (.05 mm2) What about area? Here we come to another problem facing software programmable processors, that is the memory bottleneck. The increases in clock rate have made memory access more critical to performance, and as a result, memory is being brought on chip. High performance processors dedicated even more area to several levels of cache memory. To illustrate the problem, imagine that our algorithm was nothing more than a 100-tap FIR filter which can be accomplished with an array of multipliers. We could layout 100 multipliers in parallel and still be 1/5 the size of this DSP processor. So, one must ask, why time multiplex to save area if the overhead is much greater than the savings? As we look to the future, we see the amount of memory on chip continuing to increase, further decreasing the area efficiency of these architectures. What’s the problem? How did we get here? The problem is that the Von Neuman architecture is now more than 50 years old. Hardware just isn’t as expensive as it used to be. We need to look beyond single threads of execution if we want to exploit the capability of silicon. DSP processor with 1 multiplier (25 mm2) Why time multiplex to save area if the overhead is much greater than the area saved????

Parallelism is the answer, but …
Not by putting Von Neumann processors in parallel and programming with a sequential language Attempts to do this have failed over and over again… The parallel computer compiler problem is very difficult Not by trying to capture parallelism at the instruction level Superscalar, VLIW, etc… are very inefficient Hardware can’t figure out the parallelism from a sequential language either The problem is the initial sequential description (e.g. C) which is poorly matched to highly parallel applications The answer is parallelism... BUT… not by putting several single-thread processors in parallel and continuing to program them with a sequential language. There have been attempts, but the parallel computer compiler problem is very difficult, and so parallel processor systems don’t really buy you anything unless they’re running multiple threads. What we’re seeing much more of these days are attempts to capture instruction-level parallelism, but this isn’t really the answer, either. Superscalar and VLIW architectures are still very energy inefficient, and, in fact, they don’t do a much better job of capturing parallelism because they’re just trying to do in hardware what we couldn’t do with the compilers up here. The problem is that this initial sequential description is poorly matched to parallel applications.

What is really hapenning…
Re-entering it using a sequential description Then try to rediscover the parallelism Starting with a parallel algorithmic description While (i=0;i++:i<num) { a = a * c[i]; b[i] = sin (a * pi) + cos(a*pi); }; Outfil = b[i] * indata; So, what’s really happening? We start with a description of the dedicated stand-alone signal processing algorithm that we want to embed in our portable device. This description is essentially parallel, but we re-enter it using a sequential description so that we can simulate it easily. Then, we try to re-discover the parallelism. This is a HARD problem. It’s like trying to write a program to translate from English to Japanese and back again with no loss of information. It’s a HARD problem! And we have to ask “Why are we doing this to ourselves?” Just so that we can waste orders of magnitude of energy and area? No. We’ll get back to the reasons why we do it in a minute, but first let’s look at how much better we could be doing. We take this path so that we can use an architecture that is orders of magnitude less efficient in energy and area ??????

What can a fully parallel CMOS solution potentially do?
In .25 micron a multiplier requires .05 mm2 and 7pJ per operation at 1 V. Adders and registers are about 10 times smaller and 10 times lower energy Lets implement a 50mm2 , .25 micron chip using adders, registers and multipliers We can have 2000 adders/registers and 200 multipliers in less than 1/2 of the chip, also assume 1/3 of power goes into clocks 25 MHz clock (1 volt) gives ~50 Gops at 100mW 500 MOPS/mW and 1000 MOPS/mm2 Let’s look at how we would build a piece of dedicated hardware. We know that in 0.25 um a 16-bit multiplier requires about 0.05 square mm’s of area and uses about 7 pJ at 1V. Adders and registers are about 10 times smaller and lower energy. If we limit ourselves to a 50 square mm chip and assume that half of the area goes to pads and routing and that sort of thing, then we still have room for 2000 adders and registers and 200 multipliers. If we also assume that we have to add 1/3 of the total power back for the clock, then we can still get 50 GOPS with a 25 MHz clock, giving us a total power of 100 mW. This gives us 500 MOPS/mW and 1000 MOPS/square mm.

Start with a parallel description of the algorithm…
So, how do we design it? We start with a parallel description of the algorithm…

Then directly map into hardware …
And then map it directly into hardware. Each operation becomes a functional unit, and each connection becomes hard-wired. It’s really quite simple. Mult2 Mac2 Mult1 Mac1 S reg X reg Add, Sub, Shift

Results in fully parallel solutions
Energy Area 64-point FFT Energy per Transform (nJ) 16-State Viterbi Decoder Energy per Decoded bit (nJ) Transforms per second per unit area (Trans/ms/mm2) Decode rate per unit area (kb/s/mm2) Direct-Mapped Hardware 1.78 0.022 2,200 200,000 FPGA 683 5.5 1.8 100 Low-Power DSP 436 19.6 4.3 50 High-Performance DSP 1700 108 10 150 Now, we’ve been talking in generalities for a long time, but let’s look at a specific example. Take the a wireless networking standard which includes a 64-point FFT and a 16-state Viterbi decoder. We did this comparison one year ago using vendor-published bechmarks for the leading industry FPGA and high-performance and low-power programmable DSP’s. We see that for these applications, the high-performance software processor is 3 and 4 orders of magnitude less energy efficient. The low-power architecture is better, but still way off compared to the dedicated hardware. The FPGA is kind of confusing… in one case it’s worse than the low-power DSP, and in the other better. So, we’re still somewhat confused about what hardware reconfigurability buys us. If we look now at area efficiency, we see that the high-performance software processor performs better than the FPGA and low-power DSP, but the direct-mapped hardware is still orders of magnitude better. (numbers taken from vendor-published benchmarks) Orders of magnitude lower efficiency even for an optimized processor architecture

Reasons software solutions seem attractive
(1) Believed to reduce time-to-system-implementation (2) Provides flexibility (3) Locks the customers into an architecture they can’t change (4) Difficulty in getting dedicated SOC chips designed Are these good reasons??? So now, let’s get back to the question, “Why we use software?” In the past few years, we’ve talked a lot about the trade-off between hardware and software, and tools like VCC from Cadence allow us to explore this trade-off. But three orders of magnitude improvement is not a trade-off, so it’s kind of silly to keep talking about it. Why do we use software? There seem to be 4 reasons. Number 1, it’s believed to reduce time-to-system-implementation. Secondly, it provides flexibility. Third, it locks cusomers into an architecture they can’t change, and fourth, designing systems-on-a-chip is extremely diffcult. Are these good reasons?

(1) Believed to reduce time-to-system implementation
Software decreases time to get first prototype, but time to fully verified system is much longer (hardware is often ready but software still needs to be done) Limitations of software prototype often sets the ultimate limit of the system performance Software solutions can be shipped with bugs, not a real option for SOC Let’s start with the first reason. Yes, software does decrease the time to get the first prototype, but the time to get the fully working system is much longer. Software verification is another VERY HARD problem. This is mainly because the execution time of software isn’t deterministic. These processors have so much “state” that there are orders of magnitude more cases to verify. In comparison, dedicated hardware is deterministic and much easier to verify. Secondly, when we jump so quickly to a software prototype we eliminate the possibility of using dedicated logic and set a much lower limit to the ultimate system performance. It’s a high price to pay, it would be nice if we didn’t have this limit. Lastly, software solutions can be shipped with bugs. I don’t know… is this an advantage or a disadvantage? Less work for us, maybe? But one question is, do we measure the implementation time to the date it is shipped or the date the last bug fix is distributed? Something to think about, but in the end, it’s questionable whether software programmability actually reduces time to implementation.

(2) Need flexibility Software is not always flexible
Can be hard to verify Flexibility does not imply software programmability Domain specific design can have multiple modules, coefficients and local state control (the factor of 100 in efficiency) to address a range of applications Reconfiguration of interconnect can achieve flexibility with high levels of efficiency Now, let’s look at the second reason for using software: flexibility. Here I have two points that I’d like to make. First of all, software is not always flexible. Because software verification is so hard, once it is verified, it is very inflexible, and the hardware has very rigid constraints to make sure that the software execution is not affected. Secondly, flexibility does not imply software programmability. We’ve done many investigations into domain-specific architectures which address a range of applications and can be much more efficient.

Flexibility without software
For example, here are more numbers from the architectural comparison I showed earlier, here for FFT’s ranging from 16 to 512 points. These points up here show the FPGA and high-performance and low-power programmable DSP’s. These lowest points here show the efficiency of purely dedicated hardware, while these points which are a factor of two less efficient are for a dedicated FFT processor which can be reconfigured for any size of FFT. This function-specific reconfigurable hardware is flexible but still two to three orders of magnitude more efficient in terms of energy and area than other methods. So the big unanswered question is really, how much and what kinds flexibility do we need? Energy per Transform vs. FFT size Transforms per Second per mm2 vs. FFT size * All results are scaled to 0.18mm

Reasons software solutions seem attractive
(1) Believed to reduce time-to-system implementation (2) Provides flexibility (3) Locks the customers into an architecture they can’t change (4) Difficulty in getting dedicated SOC chips designed So, let’s go back to our list. Number 3… locks the customers into an architecture they can’t change. I can see how that’s good for some, but not necessarily good for everyone. However, I’m not a business major, so there’s not much I can say or do about this. So what I’ve been focusing on the last four years is this last point: the difficulty in getting SOC chips designed. So, let me now shift gears and talk about my work. In particular, what can we do to make it easier to design these dedicated chips?

Standard DSP-ASIC Design Flow
Algorithm Design Floating-Point Simulation System/Architecture Design Fixed-Point Simulation Hardware/Front- End Design RTL Code Physical/Back- End Design Mask Layout Sequential Mixed Sequential & Structural Integer only, Structural w/ Sequential Leaf-cells Single-wire Connectivity w/ Timing Constraints Problems: Three translations of design data Requirements for re-verification at each stage Uncontrolled looping when pipeline stalls First let’s ask, why is it so hard to design these systems? It’s mainly because each new algorithm requires a new architecture and has a completely new design space to be explored. But it is extremely difficult to explore the design space with the standard ASIC flow. Hardware implementation of algorithms is typically broken into four phases handled by four different designers. Algorithm designers conceive the chip and deliver a specification to system designers, often in the form of a floating-point simulation, such as a bit-error-rate vs. signal-to-noise ratio simulation. The system or architecture designers begin to add structure to this simulation and convert the data types from floating to fixed-point. Hardware designers then write RTL code which satisfies this functionality, and physical designers map the RTL code to mask layout. This flow requires 3 translations of the design, expressing the functionality as gradually less sequential and more structural with requirements for re-verification at each stage… and we just talked about how the verification problem is the most difficult part. Thus, the new and unusual architectures that we’re looking for tend to stall the flow, leading to uncontrolled looping back to earlier stages of the design process and extending the design time indefinitely. What can we do about this? Prohibitively Long Design Time for Direct Mapped Architectures

Direct Mapping Design Flow
Algorithm/System Simulation Front-End RTL Libraries Back-End Floorplan Automated Flow Mask Layout Performance Estimates In order to realize the benefits of direct mapping, we need a flow more like this. We would like to explore the design space as thoroughly as possible, preferably from the algorithm or systems perspective where the greatest advances are to be made. We would like to refine floating-point types to fixed-point types within the same description. We would like to be constrained by efficient libraries created by hardware/front-end designers and also to benefit from physical/back-end designers’ understanding of interconnect. To achieve this goal, we capture the decisions made by each designer and fully automate the flow from these inputs to generate mask layout and performance estimates. This controls iteration by ensuring that designers do not have to continually translate design data. It encourages feedback of physical design issues to algorithm and system designers by allowing them to maintain ownership of design data at all times. It also improves interaction among the different designers by reducing the design flow to a single phase. However, we need a well-integrated design flow for this approach to work. Encourages iterations of layout Controls looping Reduces the flow to a single phase Depends on fast automation

Déjà vu??? An automated style of design with parameterized modules processed through foundries is just the reincarnation of good ole Silicon Compilation of >10 years ago What happened? A decline of research into design methodologies A single dominant flow has resulted - the Verilog-Synopsys-Standard Cell Lack of tool flows to support alternative styles of design Research community lost access to technology – moved to highly sub-optimal processor and FPGA solutions Now, let’s take a step back for a moment. Doesn’t this sound familiar? Isn’t this just a rehashing of all the silicon compiler work that happened in the 80’s? What happened? In the last 10 years, we’ve seen a decline of research into design methodologies, mostly due, I think, to the fact that processing technology wasn’t advanced enough to make them worth-while. And so, a single, dominant industry standard design flow has emerged, centered around Verilog and Synopsys Design Compiler-like tools. EDA companies make tools, not flows, and so there’s no incentive to make a tool that doesn’t fit into the standard flow. So, we’ve been kind of stuck. As a result, you don’t see many digital chips coming from the research community any more. Researchers have moved to processor and FPGA solutions which are highly sub-optimal. We’d like this to change. There’s a lot of effort these days in the EDA community to separate behavior from implementation. But I don’t agree with this approach. Why not, instead, seek the algorithms that have most efficient silicon implementations?

Capturing Design Decisions
reg. file S MAC reg. file add shift Categories: Function - basic input-output behavior Signal - physical signals and types Circuit - transistors Floorplan - physical positions Now, let’s return to the discussion of our flow. In order to provide a well-automated design flow, we must be very specific about how we make and capture design decisions. Our goal with this flow was to get algorithm developers to create their simulations as data-flow graphs instead of writing C or Matlab code. These graphs capture the function decisions for the flow, specifying the basic input-output behavior of the design. Furthermore, we can capture signal decisions such as word lengths as properties associated with the edges of this graph. Circuit decisions governing the transistors used to implement each functional unit can be captured as properties of the nodes. We also wanted to have a companion floorplan view which was used solely to specify positions of functional units from the graph. Wouldn’t it be great if we could then push a button and get layout and performance estimates for this design within the same day. How to get layout and performance estimates in a day?

Simplified View of the Flow
merge autoLayout elaborate netlist route layout dataflow graph floorplan macro library New Software: Generation of netlists from a dataflow graph Merging of floorplan from last iteration Automatic routing and performance analysis Automation of flow as a dependency graph (UNIX MAKE program) We found that in order for this to happen, we had to write a lot of new software. First, we wrote software to translate data-flow graphs to an electronic design format. This “elaboration” step must also invoke macro generators and stitch everything into a netlist of routable objects. Next, we wrote programs to merge placement information from the floorplan views with the netlist, creating autoLayout views. Physical designers modify these autoLayout views and save them as floorplans for the next iteration. We also wrote programs which automatically route, verify, and characterize the design. Lastly, we described our design flow as a dependency graph and created a tool much like the UNIX MAKE program to automate it.

Time-Multiplexed FIR Filter
Why Simulink? Time-Multiplexed FIR Filter Why did we choose Simulink as our data-flow graph editor? In short, Simulink is an easy sell to algorithm developers. This is primarily because it is closely integrated with Matlab which is popular among algorithm experts and system designers. If we’re serious about getting our algorithm people to create layout, then we need to make it as easy as possible for them to approach our environment. Furthermore, we have successfully modeled a variety of digital data-paths with Simulink as well as co-simulating them with models of analog circuits. Thus, we know that Simulink is sufficient for the kinds of wireless baseband algorithms which are of most interest to us. This simple example of a time-multiplexed FIR filter illustrates how we use Simulink. Here we see a multiply-accumulate block being fed by an input data stream and tap coefficients from an SRAM and control logic. Simulink is an easy sell to algorithm developers Closely integrated with popular system design tool Matlab Successfully models digital and analog circuits

Modeling Datapath Logic
Discrete-Time (cycle accurate) Fixed-Point Types (bit true) Completely specify function and signal decisions No need for RTL Here’s an example of how we model datapath logic. It’s a detail of the multiply-accumulate block. Here we see a multiplier, adder, register, and multiplexor. Notice that the register is modeled as a unit delay, and that means that we’re using a discrete-time computation model for this dataflow graph. We use discrete time because it can be made cycle-accurate with respect to the hardware and is thus easy to verify. We also have the option of replacing floating-point blocks with fixed-point blocks so that it can be made bit-true with respect to the hardware, again for verification purposes. The goal here is to specify functionality and signals in the dataflow graph so completely that there is never any need for a complete RTL simulation of the system. Multiply / Accumulate

Modeling Control Logic
Extended finite state-machine editor Co-simulation with dataflow graph New Software: Stateflow-VHDL translator No need for RTL Now, we need to do something about control logic. It’s difficult to model control with data-path primitives, so we need some sort of finite state-machine primitive in our data-flow graph. Simulink offers one called Stateflow. The example shown here shows the address generator and MAC reset control logic for the FIR filter shown earlier. This chart shows an initial loop to load tap coefficients, with successive loops reading the coefficients and resetting the accumulator. To integrate Stateflow with our environment, we wrote a new program to translate these charts into synthesizable VHDL code. Again, we want this translation to happen in a way that we never need to look at the RTL code. Address Generator / MAC Reset

Specifying Circuit Decisions
Time-Multiplexed FIR Filter Stateflow- VHDL translator RTL Code or Data-path Generator Code or Custom Module Black Box So, we’ve talked about function and signal decisions, let’s talk about circuit decisions. We specify circuit decisions by embedding macro choices in Simulink. The Data-flow portion of the design can be specified as RTL code, Datapath generator code, or even custom modules created by ASIC designers. This means that we will have a second functional description, and we must verify that it is equivalent to the dataflow graph model. At present, this is done with cross check simulations, though we could develop more formal methods of checking or even translating from our dataflow graph into code for a datapath generator. Using our Stateflow-VHDL translator, control logic can be synthesized like any RTL macro. Finally, RAM’s can be specified as black box macros for vendor-supplied memories. Macro choices embedded in dataflow graph Cross-check simulations required

Hierarchy Hardened Progressively
System-Level Design Environment Hard Macro Characterization Libraries estimate performance: power, area, delay layout and characterize new hard macro Macro characterization saved for fast estimates Each level of hierarchy becomes a new hard macro Higher levels of hierarchy are adjusted When top level of hierarchy is hardened, the design is done One of the most important parts of our environment is the ability to quickly route and characterize these macros, turning them into “hard-macros” and storing the characterization info for fast estimates later on. The performance of these “hard-macros” is well understood, meaning that our fast estimates for the entire system will have less variance as we harden more of the macros. The higher levels of the hierarchy can then be adjusted to compensate for any incorrect assumptions. In using this flow, we find that the design process tends to progress by routing and characterizing the entire hierarchy from the bottom-up. We call this process “hierarchy hardening”, and once the entire hierarchy is hardened,the design is done.

Capturing Floorplan Decisions
Parallel Pipelined FIR Filter Now, let’s talk about floorplan decisions. We capture floorplan decisions with commercial physical design tools. The initial skeleton floorplan is generated by the automated flow. Physical designers then edit the floorplan, placing instances and boundary pins. To facilitate the merging of placement information on each iteration, we constrain the instance names in the floorplan to match the block names in the dataflow graph, as illustrated in this example floorplan for a parallel pipelined FIR filter which I will discuss more later. Furthermore, having this companion floorplan view allows us to improve our fast performance estimates by predicting the parasitics of global wires with manhattan distances. Commercial physical design tools used Instance names in floorplan match dataflow graph Placements merged on each iteration Manhattan distance can be used for parasitic estimates

Reduced Impact of Interconnect
FO4 inv delay Wire delay ... 0.18 mm Long wires can be modeled as lumped capacitances So now, we’ve talked about the basic approach of our flow, but we haven’t said anything that addresses the difficulty of deep sub-micron design, so let’s take a minute to mention some things that make design in this domain easier. First of all, the impact of interconnect on the design process is reduced at low supply voltages, due to slow transistors. This example shows the ratio of RC wire delay to the logic delay of a fan-out-of-4 inverter in 0.18 um for a range of supply voltages. The graph shows that at the industry standard supply voltage for a 5mm metal 6 wire, the ratio is about 1/2, that is the wire delay is comparable to logic delay. As we drop the supply voltage, however, the RC wire delay does not change while the logic delay rises considerably. This means that at low voltages, long wires can be accurately modeled as lumped capacitances, making it easier to predict delay from simple manhattan distances measured in the floorplan. We also don’t have to worry about issues such as repeater insertion, making it considerably easier to design large systems.

Race-Immune Clock Tree Synthesis
t < t t skew(max) clk-Q(min) hold(max) Example Clock Tree Stages: Sinks: Skew: ps Clock Power: mW Logic Power: 21 mW Hierarchical Clock Tree Synthesis Race margin = 580 ps 0.18 mm VDD = 1 V Also, low supply voltages allow us to pursue race-immune clock-tree synthesis. We define the quantity “race margin” for a given technology to be the minimum clock-to-Q delay of all clocked elements minus the maximum hold time. In a 0.18 um technology at 1 volt and typical process parameters, for example, we have a race margin of 580 ps. If the absolute skew of the clock tree is less than the race margin, then no checking for short paths is required to prevent races. This simplifies the design flow immensely. It’s common practice today to insert chains of inverters in all logic paths to increase the delay and make the design more race tolerant. Not only does this waste area, power and cycle time, but it’s hard to do correctly. It’s much easier to automate a design flow if you can insert a race-immune clock tree. It isn’t really easy to do that, either, but it’s not impossible, and so far we’ve demonstrated it on a 600k transistor design. It is, however, the one part of our flow which is not yet fully automated. Demonstrated on a 600k transistor design

Example 1: Macro Hardening
180 MB 1.5 GB disk space (elaborate / route) (characterization) 3 hours 9 hours execution time (elaborate / route) (characterization) 240 k transistors 21 k cells 18.0 ns critical path delay (1 V, PathMill) 13.0 mW 25 MHz (1 V, PowerMill) 1.4 mm2 area in 0.25 mm parallel pipelined FIR filter So now, let me show you a few examples of what we have done with this flow. Here is an example of hardening a flat version of the parallel pipelined FIR filter shown earlier. This design is a decimation filter for a 12-bit, 200 MHz sigma-delta converter, parallelized to provide 8 streams at 25 MHz, each offset by 1/8th of a cycle. In addition to routing, the flow launches EPIC PowerMill and PathMill analyses to estimate the power and critical path delay of the layout. The execution time and disk space required by the flow demonstrate the fundamental cost of push-button automation in the design process. The flow took 3 hours and 180 MB of disk space to complete the elaboration and routing portions of the flow. Characterization of the blocks took considerably more time and space, 9 hours and 1.5 GB. Macros of 1 to 10 thousand transistors run through the flow considerably faster, needing less than 30 minutes and using less than 100 meg of disk space in total. Most time/disk space spent on extraction and power simulation

Example 2: Test Chip 300k transistors 0.25 mm 1.0 V 25 MHz 6.8 mm2
14 mW 2 phase clock 3 layers of P&R hierarchy Here is a die photo of the first test-chip made with our flow. It is a version of the parallel pipelined FIR filter shown in the last slide using the hierarchical floorplan shown earlier. The design has 3 layers of routing hierarchy. This was more layers than necessary, but it allowed us to exercise our hierarchical place & route flow more thoroughly. The chip has 300,000 transistors and consumes 14 mW at 25 MHz. This chip demonstrates our entire methodology except for race-immune clock-tree synthesis. A 2-phase clock was used to avoid race problems. The low ratio of transistors to area is due to the excessive detail of the floorplanning. More recent versions of the flow allow selective flattening of the hierarchy to improve density. Parallel Pipelined FIR Filter (8X decimation filter for 12-bit 200 MHz SD)

TDMA Baseband Receiver
carrier detection frequency estimation rotate & correlate control 600k transistors 0.18 mm 1.0 V 25 MHz 1.1 mm2 21 mW single phase clock 5 clock domains 2 layers of P&R hierarchy A complete baseband receiver chip which exercises the flow more thoroughly is scheduled to be taped out very soon. The design includes 3 Module Compiler macros: a carrier detection macro to recover coarse timing, a frequency estimation block to achieve fine timing, and a rotate and correlate block with a phase locked loop to coherently provide soft symbols. The design also features control logic generated from Stateflow. This design achieves a greater density than the test chip by having only 2 layers of routing hierarchy. This design also has a single phase clock with 5 domains, allowing the clock to be switched off when not in use to save power .

Conclusions Direct-Mapped hardware is the most efficient use of silicon Direct-Mapped hardware can be easier to design and verify than embedded hardware/software systems Don’t translate design data, refine it Design with dataflow graphs, not sequential code Design flow automation speeds up design space exploration

Embedded Processor Architectures and (Re)Configurable Computing
Vandana Prabhu Professor Jan M. Rabaey Jan 10, 2000

Pico Radio Architecture
FPGA Embedded uP Dedicated FSM Dedicated DSP Reconfigurable DataPath

Reconfigurable Computing: Merging Efficiency and Versatility
Spatially programmed connection of processing elements. “Hardware” customized to specifics of problem. Direct map of problem specific dataflow, control. Circuits “adapted” as problem requirements change.

Matching Computation and Architecture
AddressGen Memory MAC Control Processor L C G Two architectural models: sequential control+ data-driven Convolution Two models of computation: communicating processes + data-flow Mention details are shown on the Pleiades poster and Marlene Wan’s software methodology for this architecture.

Implementation Fabrics for Data Processing
300 million multiplications/sec 357 million add-sub’s/sec Adaptive Pilot Correlator Digital Baseband Receiver DSP Power: 460mW Area: 1089mm2 Power: 1500mW Area: 3600mm2 Direct Mapped Power: 3mW Area: 1.3mm2 Power: 10mW Area: 5mm2 Pleiades Power: 18.49mW Area: 5.44mm2 Power: 62.33mW Area: 21.34mm2 Data In 16 Mmacs/mW!

Software Methodology Flow
Algorithms Area & Timing m proc & Constraints Accelerator PDA Models Kernel Detection Behavioral Xform’s Estimation/Exploration for low power Premapped Power & Timing Estimation Kernels of Various Kernel Implementations Kernels Partitioning Executable Intemediate Form Reconfig HW Software Compilation Reconfig. Hardware Mapping Interface Code Generation Interconnect Optimization (Marlene Wan)

Maia: Reconfigurable Baseband Processor for Wireless
0.25um tech: 4.5mm x 6mm 1.2 Million transistors 40 MHz at 1V 1 mW VCELP voice coder Hardware 1 ARM-8 8 SRAMs & 8 AGPs 2 MACs 2 ALUs 2 In-Ports and 2 Out-Ports 14x8 FPGA

Implementation Fabrics for Protocols
A protocol = Extended FSM BUF Memory Slot_Set_Tbl 2x16 addr slot_set <31:0> Slot_no <5:0> Slot start Pkt end RACH req akn W_ENA R_ENA update idle write read slotset ASIC: 1V, 0.25 mm CMOS process FPGA: 1.5 V 0.25 mm CMOS low-energy FPGA ARM8: 1 V 25 MHz processor; n = 13,000 Ratio: >> 400 Idea: Exploit model of computation: concurrent finite state machines, communicating through message passing Intercom TDMA MAC

Low-Power FPGA Low Energy Embedded FPGA (Varghese George) Test chip
8x8 CLB array 5 in - 3 out CLB 3-level interconnect hierarchy 4 mm2 in 0.25 mm ST CMOS 0.8 and 1.5 V supply Simulation Results 125 MHz Toggle Frequency 50 MHz 8-bit adder energy 70 times lower than comparable Xilinx

An Energy-Efficient µP System
Dynamic Voltage Scaling (Trevor Pering & Tom Burd) Integrated dc-dc converter Lower speed, Lower voltage, Lower energy Before Mention use of Tom Burd’s core shrunk to 0.25u for MAIA chip. More details in his presentation µProc. Speed After Idle

Xtensa Configurable Processor
Xtensa (Tensilica,Inc) for embedded CPU Configurability allows designer to keep “minimal” hardware overhead ISA (compatible with 32 bit RISC) can be extended for software optimizations Fully synthesizable Complete HW/SW suite VCC modeling for exploration Requires mapping of “fuzzy” instructions of VCC processor model to real ISA Requires multiple models depending on memory configuration ISS simulation to validate accuracy of model (Vandana Prabhu)

Microprocessor Optimizations for Network Protocols
ImplementsTransport layer on configurable processor TDMA control and channel usage management Upper layer of protocol is dominated by processor control flow Memory routines, Branches, Procedure calls Artifacts of code generation tools is significant Excessively modular code introduces procedure calls Uses dynamic memory allocation Configurable processor Increased size of register file Customized instructions help datapath but not control Total Execution Time calloc memcpy other Memory Routines Refer to Kevin Camera/Tim Tuan’s poster for more details. Efficient implementaion at code generation and architecture levels! (Kevin Camera & Tim Tuan )

Implementation Methodology for Reconfigurable Wireless Protocol
Changing granularity within protocol stack requires estimation tool for energy-efficient implementation Software exploration on processors Exploring Xtensa’s TIE Hardware exploration on FPGA platforms Optimal FPGA architecture Alternately “Reconfigurable FSM” analogous to Pleiades approach for datapath kernels (Suetfei Li & Tim Tuan)

TCI - A First Generation PicoNode
Tensilica Embedded Proc. Memory Sub-system Sonics Backplane Talk about how VCC exploration has included the models for Tensilica CPU & Sonics Interconnect and we are going to get mappings from behavior to architecture and this is what we envision the TCI system to be. Communication is as important as computation and power-delay-area are the metrics we will look at in the explorations.. Now some interesting experiments with software mapping of different layers of the protocol stack on different platforms: the configurable Tensilica processor with extensible ISA to reconfigurable FPGA platforms. Low energy embedded FPGAs design has been demonstrated. We also have the Pleiades architecture template that would be used for reconfigurable datapath. Programmable Protocol Stack Baseband Processing Configurable Logic (Physical Layer)

The System-on-a-Chip Nightmare
Bridge DMA CPU DSP Mem Ctrl. MPEG C I O System Bus Peripheral Bus Control Wires Custom Interfaces The “Board-on-a-Chip” Approach Refer to Rhett’s Design Flow effort towards an integrated CAD tool flow for design & verification. Courtesy of Sonics, Inc

The Communications Perspective
(Mike Sheets) DSP MPEG CPU DMA C MEM I O Example: “The Silicon Backplane” (Sonics, Inc) Open Core ProtocolTM Guaranteed Bandwidth Arbitration SiliconBackplane AgentTM Communications-based Design

Summary Design for low-energy impacts all stages of the design process — the earlier the better Energy reduction requires clear communication and computation abstractions Efficient and abstract modeling of energy at behavior and architecture level is crucial Efficient hardware implementation of protocol stack Beat the SoC monster! Importance of building a library of hardware modules from Low Power Microprocessors( both soft & hard IP), Reconfigurable Datapath & Interconnect, Embedded FPGA and Interconnect. Not just have design models but be able to abstract these to higher level exploration environments like VCC( like what’s been done for Tensilica processor & Sonics interconnect) Work on fur

Targeting Tiled Architectures in Design Exploration
Lilian Bossuet1, Wayne Burleson2, Guy Gogniat1, Vikas Anand2, Andrew Laffely2, Jean-Luc Philippe1 1 LESTER Lab Université de Bretagne Sud Lorient, France {lilian.bossuet, guy.gogniat, 2 Department of Electrical and Computer Engineering University of Massachusetts, Amherst, USA {burleson, vanand,

Design Space Exploration: Motivations
Design solutions for new telecommunication and multimedia applications targeting embedded systems Optimization and reduction of SoC power consumption Increase computing performance Increase parallelism Increase speed Be flexible Take into account run-time reconfiguration Targeting multi-granularity (heterogeneous) architectures

Design Space Exploration: Flow
Progressive design space reduction: iterative exploration refinement of architecture model increase of performance estimation accuracy One level of abstraction for one level of estimation accuracy

Reconfigurable Architectures
Bridging the flexibility gap between ASICs and microprocessor [Hartenstein DATE 2001] Energy efficient and solution to low power programmable DSP [Rabaey ICASSP 1997, FPL 2000] Run Time Reconfigurable [Compton & Hauck 1999] => A key ingredient for future silicon platforms [Schaumont & all. DAC 2001]

Design Space of Reconfigurable Architecture
RECONFIGURABLE ARCHITECTURES (R-SOC) FINE GRAIN (FPGA) MULTI GRANULARITY (Heterogeneous) COARSE GRAIN (Systolic) Processor + Coprocessor Tile-Based Architecture Island Topology Hierarchical Topology Coarse Grain Coprocessor Fine Grain Coprocessor Mesh Topology Linear Topology Hierarchical Topology Xilinx Virtex Xilinx Spartran Atmel AT40K Lattice ispXPGA Altera Stratix Altera Apex Altera Cyclone Chameleon REMARC Morphosys Pleiades Garp FIPSOC Triscend E5 Triscend A7 Xilinx Virtex-II Pro Altera Excalibur Atmel FPSIC aSoC E-FPFA RAW CHESS MATRIX KressArray Systolix Pulsedsp Systolic Ring RaPiD PipeRench DART FPFA

A Target Architecture: aSoC
Adaptive System-on-a-Chip (aSoC) Tiled architecture containing many heterogeneous processing cores (RISC, DSP, FPGA, Motion Estimation, Viterbi Decoder) Mesh communication network controlled with statically determined communication schedule A scalable architecture.

FPGA in System-on-a-Chip
Fast Time-To-Market Post-Fabrication Customization Broaden application domain Run-time Reconfiguration Bug Fixes Upgrades 10x-100x Worse: Area Performance Power Mark L. Chang

aSoC Architecture Heterogeneous Cores Point-to-point connections
ctrl South Core West North East Communication Interface tile Heterogeneous Cores Point-to-point connections uProc MUL FPGA MUL

aSoC Communications Interface
Interface Crossbar inter-tile transfer tile to core transfer Interconnect/Instruction Memory contains instructions to configure the interface crossbar (cycle-by-cycle) Interface Controller selects the instruction Coreports data interface and storage for transfers with the tile IP core Dynamic Voltage and Frequency Selection Dynamic Power Management Core Coreports Interface Crossbar North North South South East East West West Outputs Inputs Local Config . Local Decoder Controller Frequency & Voltage North to South & East PC Instruction Memory

aSoC Exploration ... Type of tiles Number of each type of tile
Placement of the tiles Intern architecture of reconfigurable tiles (FPGA core) Communication scheduling

Design Space Exploration: Goals
Goal: Rapid exploration of various architectural solutions to be implemented on heterogeneous reconfigurable architectures (aSoC) in order to select the most efficient architecture for one or several applications Take place before architectural synthesis (algorithmic specification with high level abstraction language) Estimations are based on a functional architecture model (generic, technology-independent) Iterative exploration flow to progressively refine the architecture definition, from a coarse model to a dedicated model

Design Exploration Flow Targeting Tiled Architecture
SPECIFICATION C to HCDFG parser Function F 2 HCDFG Graphs of the application Application App 1 Model of the aSOC Architectures Tile T aSOC A Analysis Tile Exploration Results of the Tile exploration step Performance 11 , C , Occ 21 12 22 Builder Static Communication Scheduling Final model of aSOC architecture THF Model HF Model

Application Analysis Use of algorithmic metrics and dedicated scheduling algorithms to highlight the target architectures Algorithmic metrics: Characterize the application orientation Processing Memory Control Characterize the application potential parallelism

Tile Exploration: with 3 steps
Projection: Link between necessary resources (application) and available resources (tile) Use of an allocation algorithm based on communication costs reduction Composition: Take into account of the function scheduling to estimate additional resources (register, mux, …) Estimation: performance interval computation (lower and upper bounds) speed/resource utilization/power characterization

aSoC Builder Environment AppMapper Partition and assignment
based on Run Time Estimation Compilation Communication Scheduling Core compilation Generate tiles configuration Communications instructions Bitstreams (for reconfigurable tile) RISC instructions

aSoC Analysis Use the results of previous steps
Functions scheduling Tile allocation Communication scheduling Complete estimation of the proposed solution Global execution time Global power consumption Total area

Power-Aware System on a Chip
A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture Conference 30 Jan 2003 {alaffely, jliang, tessier, moritz, This material is based upon work supported by the National Science Foundation under Grant No Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Adaptive System-on-a-Chip
mProc Tile Multiplier FPGA Tiled architecture with mesh interconnect Point to point communication pipeline Allows for heterogeneous cores Differing sizes, clock rates, voltages Low-overhead core interface for On-chip bus substitute for streaming applications Based on static scheduling Fast and predictable ctrl South Core West North East Communication Interface

aSoC Implementation 2500 l .18 m technology Full custom 3000 l

Some Results 9 and 16 core systems tested for IIR, MPEG encoding and Image processing applications ~ 2 x the performance compared to Coreconnect bus Burst and Hierarchical ~ 1.5 x the performance of an oblivious routing network1 (Dynamic routing) Max speedup is 5 x 1. W. Dally and H. Aoki, “Deadlock-free Adaptive Routing in Multi-computer Networks Using Virtual Routing”, IEEE Transactions on Parallel and Distributed Systems, April 1993

Digital Integrated Circuits A Design Perspective

Similar presentations

Presentation on theme: "Digital Integrated Circuits A Design Perspective"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Digital Integrated Circuits A Design Perspective

Similar presentations

Presentation on theme: "Digital Integrated Circuits A Design Perspective"— Presentation transcript:

Similar presentations

About project

Feedback