Frank Vahid, UC Riverside 1 New Opportunities with Platform Based Design Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Slides:



Advertisements
Similar presentations
1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.
Introduction to Microprocessors and Microcomputers.
Device Tradeoffs Greg Stitt ECE Department University of Florida.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Memory See: P&H Appendix C.8, C.9.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Instruction-based System-level Power Evaluation of System-on-a-chip Peripheral Cores Tony Givargis, Frank Vahid* Dept. of Computer Science & Engineering.
Some Thoughts on Technology and Strategies for Petaflops.
Frank Vahid, UC Riverside 1 System-on-a-Chip Platform Tuning for Embedded Systems Frank Vahid Associate Professor Dept. of Computer Science and Engineering.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
A highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Walid Najjar* *University of California, Riverside **The.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.
Configurable System-on-Chip: Xilinx EDK
Frank Vahid Associate Professor
A Highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang, Frank Vahid and Walid Najjar University of California, Riverside ISCA 2003.
Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science.
UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.
Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.
Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.
1 Introduction A digital circuit design is just an idea, perhaps drawn on paper We eventually need to implement the circuit on a physical device –How do.
A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.
1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Propagating Constants Past Software to Hardware Peripherals Frank Vahid*, Rilesh Patel and Greg Stitt Dept. of Computer Science and Engineering University.
Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.
Computer Organization and Assembly language
Digital Circuit Implementation. Wafers and Chips  Integrated circuit (IC) chips are manufactured on silicon wafers  Transistors are placed on the wafers.
FreeBSD/arm on the Atmel AT91RM9200 Warner Losh Timing Solutions, Inc BSDcan 2006 May 12, 2006 Experiences.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
- 1 - Embedded systems: processing Embedded System Hardware Embedded system hardware is frequently used in a loop („hardware in a loop“): actuators.
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.
1 Embedded Systems Computer Architecture. Embedded Systems2 Memory Hierarchy Registers Cache RAM Disk L2 Cache Speed (faster) Cost (cheaper per-byte)
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
J. Christiansen, CERN - EP/MIC
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Computer Architecture Lecture 3 Cache Memory. Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics.
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
Embedded Systems Design: A Unified Hardware/Software Introduction 1 Chapter 3 General-Purpose Processors: Software.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
A S ELF -T UNING C ACHE ARCHITECTURE FOR E MBEDDED S YSTEMS Chuanjun Zhang, Frank Vahid and Roman Lysecky Presented by: Wei Zang Mar. 29, 2010.
1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
Introduction to VLSI Design Amit Kumar Mishra ECE Department IIT Guwahati.
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
FPGA Technology Overview Carl Lebsack * Some slides are from the “Programmable Logic” lecture slides by Dr. Morris Chang.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Nios II Processor: Memory Organization and Access
ECE354 Embedded Systems Introduction C Andras Moritz.
Introduction to Programmable Logic
Head-to-Head Xilinx Virtex-II Pro Altera Stratix 1.5v 130nm copper
Introduction to Reconfigurable Computing
Frank Vahid and Walid Najjar
Ann Gordon-Ross and Frank Vahid*
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
A Self-Tuning Configurable Cache
Computer Evolution and Performance
Dynamic Hardware/Software Partitioning: A First Approach
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

Frank Vahid, UC Riverside 1 New Opportunities with Platform Based Design Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend

Frank Vahid, UC Riverside 2 How Much is Enough?

Frank Vahid, UC Riverside 3 How Much is Enough? Perhaps a bit small

Frank Vahid, UC Riverside 4 How Much is Enough? Reasonably sized

Frank Vahid, UC Riverside 5 How Much is Enough? Probably plenty big

Frank Vahid, UC Riverside 6 How Much is Enough? More than typically necessary

Frank Vahid, UC Riverside 7 How Much is Enough? Very few people could use this

Frank Vahid, UC Riverside 8 How Much Custom Logic is Enough? 1993: ~ 1 million logic transistors IC packageIC Perhaps a bit small

Frank Vahid, UC Riverside : ~ 5-8 million logic transistors Reasonably sized How Much Custom Logic is Enough?

Frank Vahid, UC Riverside : ~ million logic transistors Probably plenty big How Much Custom Logic is Enough?

Frank Vahid, UC Riverside : ~ million logic transistors More than typically necessary How Much Custom Logic is Enough?

Frank Vahid, UC Riverside : >1 BILLION logic transistors 1993: 1 M Perhaps very few people could design this Point of diminishing returns 32-bit ARM: ~30K MPEG dcd: ~1M Other examples Fast cars (> 100 mph) High res digital cameras (> 4M) Disk space Even IC performance How Much Custom Logic is Enough?

Frank Vahid, UC Riverside 13 Very Few Companies Can Design High-End ICs Designer productivity growing at slower rate 1981: 100 designer months  ~$1M 2002: 30,000 designer months  ~$300M 10,000 1, Logic transistors per chip (in millions) 100,000 10, Productivity (K) Trans./Staff-Mo IC capacity productivity Gap Design productivity gap Source: ITRS’99

Frank Vahid, UC Riverside 14 Meanwhile, ICs Themselves are Costlier And take longer to fabricate While market windows are shrinking Less than 1,000 out of 10,000 ASIC designs have volumes to justify fabrication in 0.13 micron Tech: NRE:$40k$100k$350k$1,000k Turnaround42 days49 days56 days76 days Market:$3.5B$6B$12B$18B Source: DAC’01 panel on embedded programmable logic

Frank Vahid, UC Riverside 15 Summarizing So Far... * Transistors are less scarce ICs are big enough, fast enough * ICs take more time and money to design and fabricate While market windows are shrinking Buy pre-fabricated system-level ICs: platforms Designers

Frank Vahid, UC Riverside 16 Trend Towards Pre-Fabricated Platforms: ASSPs ASSP: application specific standard product Domain-specific pre- fabricated IC e.g., digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC Unique IC design Ignores quantity of same IC ASIC design starts decreasing Due to strong benefits of using pre-fabricated devices Source: Gartner/Dataquest September’01

Frank Vahid, UC Riverside 17 Will High End ICs Still be Made? YES The point is that mainstream designers likely won’t be making them Very high volume or very high cost products Platforms are one such product – high volume Need to be highly configurable to adapt to different applications and constraints Becoming out of reach of mainstream designers

Frank Vahid, UC Riverside 18 Configurable Platform Design: Cache uP L1 cache DSP JPEG dcd Periph- erals FPGA Pre-fabricated Platform (A pre-designed system-level architecture) IC ARM920T: Caches consume half of total power (Segars 01) M*CORE: Unified cache consumes half of total power (Lee/Moyer/Arends 99) L1 cache

Frank Vahid, UC Riverside 19 Best Cache Architecture for Embedded Systems Not clear Huge variety among popular embedded processors What’s the best… Associativity, Line size, Total size?

Frank Vahid, UC Riverside 20 Cache Associativity Direct mapped cache Certain bits “index” into cache Remaining “tag” bits compared A B C D Conflict 0000 D Tag 11 Direct mapped cache (1-way set associative) Index Set associative cache Multiple “ways” Fewer index bits, more tag bits, simultaneous comparisons More expensive, but better hit rate D110C100 2-way set associative cache 000

Frank Vahid, UC Riverside 21 Cache Associativity Reduces miss rate – thus improving performance Impact on power and energy? (Energy = Power * Time)

Frank Vahid, UC Riverside 22 Associativity is Costly Associativity improves hit rate, but at the cost of more power per access Are the power savings from reduced misses outweighed by the increased power per hit? Energy access breakdown for 8 Kbyte, 4-way set associative cache (considering dynamic power only) Energy per access for 8 Kbyte cache

Frank Vahid, UC Riverside 23 Associativity and Energy Best performing cache is not always lowest energy Significantly poorer energy

Frank Vahid, UC Riverside 24 Associativity Dilemma Direct mapped cache Good hit rate on most examples Low power per access But poor hit rate on some examples High power due to many misses Four-way set-associative cache Good hit rate on nearly all examples But high power per access Overkill for most examples, thus wasting energy Dilemma: Design for the average or worst case?

Frank Vahid, UC Riverside 25 Associativity Dilemma Obviously not a clear choice

Frank Vahid, UC Riverside 26 Our Solution: Configurable Cache Can be configured as 4, 2, or 1 way Ways can be concatenated D x 11xC10x This bit selects the way 0000 Size can also be configured By shutting down ways Saves static power (leakage) D

Frank Vahid, UC Riverside 27 Configurable Cache Design: Way Concatenation (4, 2 or 1 way) index c1c1 c3c3 c0c0 c2c2 a 11 a 12 reg 1 reg 0 sense amps column mux tag part tag address mux driver c1c1 line offset data output critical path c0c0 c2c2 c0c0 c1c1 6x64 c3c3 c2c2 c3c3 a 31 tag address a 13 a 12 a 11 a 10 index a 5 a 4 line offset a 0 Configuration circuit data array bitline Small area and performance overhead

Frank Vahid, UC Riverside 28 Configurable Cache Experiments Motorola PowerStone benchmark g3fax Way concatenate outperforms 4 way and direct map.

Frank Vahid, UC Riverside 29 Configurable Cache Experiments Configurable cache with both way concatenation and way shutdown was best on average Considered programs from Powerstone, MediaBench, and Spec2000 And, it was superior on every benchmark 100% = 4-way conventional cache

Frank Vahid, UC Riverside 30 Configurable Cache Experiments – Line Size Too Best line size also differs per example Our cache can be configured for line of 16, 32 or 64 bytes 64 is usually best; but 16 is much better in a couple cases A configurable cache with way concatenation, way shutdown, and variable line size, can save a lot of energy 100% = 4-way conventional cachecsb: concatenate plus shutdown cache

Frank Vahid, UC Riverside 31 Configurable Platform Use uP L1 cache DSP JPEG dcd Periph- erals FPGA Pre-fabricated Platform Platforms increasingly come with on-chip FPGA Can we use that FPGA to improve software performance and energy? IC FPGA uP

Frank Vahid, UC Riverside 32 Commercial Single-Chip Microprocessor/FPGA Platforms Triscend E5 chip Configurable logic 8051 processor plus other peripherals Memory Triscend E5: based on 8-bit 8051 CISC core 10 Dhrystone MIPS at 40MHz 60 kbytes on-chip RAM up to 40K logic gates Cost only about $4 (in volume)

Frank Vahid, UC Riverside 33 Single-Chip Microprocessor/FPGA Platforms Atmel FPSLIC Field-Programmable System-Level IC Based on AVR 8-bit RISC core 20 Dhrystone MIPS 5k-40k configurable logic gates On-chip RAM (20-36Kb) and EEPROM $5-$10 Courtesy of Atmel

Frank Vahid, UC Riverside 34 Single-Chip Microprocessor/FPGA Platforms Triscend A7 chip Based on ARM7 32- bit RISC processor 54 Dhrystone MIPS at 60 MHz Up to 40k logic gates On-chip cache and RAM $10-$20 in volume Courtesy of Triscend

Frank Vahid, UC Riverside 35 Single-Chip Microprocessor/FPGA Platforms Altera’s Excalibur EPXA 10 ARM (922T) hard core ~200 Dhrystone MIPS at ~200 MHz Devices range from ~200k to ~2 million programmable logic gates Source:

Frank Vahid, UC Riverside 36 Single-Chip Microprocessor/FPGA Platforms Xilinx Virtex II Pro PowerPC based 420 Dhrystone MIPS at 300 MHz 1 to 4 PowerPCs 4 to 16 gigabit transceivers 12 to 216 multipliers 3,000 to 50,000 logic cells 200k to 4M bits RAM 204 to 852 I/O $100-$500 (>25,000 units) Config. logic Up to 16 serial transceivers 622 Mbps to Gbps622 Mbps to Gbps PowerPCs Courtesy of Xilinx

Frank Vahid, UC Riverside 37 Why wouldn’t future microprocessor chips include some amount of on-chip FPGA? Single-Chip Microprocessor/FPGA Platforms

Frank Vahid, UC Riverside 38 Single-Chip Microprocessor/FPGA Platforms Lots of silicon area taken up by configurable logic As discussed earlier, less of an issue every year Smaller area doesn’t necessarily mean higher yield (lower costs) any more Previously could pack more die onto a wafer But die are becoming pad (pin) limited in nanoscale technologies Configurable logic typically used for peripherals, glue logic, etc. We have investigated another use...

Frank Vahid, UC Riverside 39 Software Improvements using On-Chip Configurable Logic Partitioned software critical loops onto on-chip FPGA for several benchmarks Most time spent in one or two loops Extensive simulated results for 8051 and MIPS For Powerstone (PS), MediaBench (MB) and Netbench (NB)

Frank Vahid, UC Riverside 40 Software Improvements using On-Chip Configurable Logic Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg)

Frank Vahid, UC Riverside 41 Speedup Gained with Relatively Few Gates Created several partitioned versions of each benchmarks Most speedup gained with first 20,000 gates Surprisingly few gates Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002 Stitt and Vahid, IEEE Design and Test, Dec J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of Embedded Systems, 2002 (to appear).

Frank Vahid, UC Riverside 42 Software Improvements using On-Chip Configurable Logic – Verified through Physical Measurement Performed physical measurements on Triscend A7 and E5 devices Similar results (even a bit better) A7 IC Triscend A7 development board

Frank Vahid, UC Riverside 43 Other Types of Configurability Microprocessor (other researchers) VLIW configurations Voltage scaling Peripherals e.g., JPEG decoder with different precisions Bus topology Etc. uP L1 cache DSP JPEG dcd Periph- erals FPGA IC

Frank Vahid, UC Riverside 44 Conclusions Trend is away from semi-custom IC fabrication Pressures encourage buying pre-fabricated platforms Platforms must be highly configurable To be useful for a variety of applications, and hence mass produced We have discussed Software speedup/energy benefits of on-chip configurable logic: 3x speedups and 34% energy savings with only ~10,000 gates Creating a highly-configurable cache architecture: 40% energy savings compared to conventional cache Designing highly-configurable platforms, and facilitating their use with good exploration tools, can help enable platform-based design See for more informationhttp://