Presentation is loading. Please wait.

Presentation is loading. Please wait.

Frank Vahid, UC Riverside 1 New Opportunities with Platform Based Design Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Similar presentations


Presentation on theme: "Frank Vahid, UC Riverside 1 New Opportunities with Platform Based Design Frank Vahid Associate Professor Dept. of Computer Science and Engineering University."— Presentation transcript:

1 Frank Vahid, UC Riverside 1 New Opportunities with Platform Based Design Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine http://www.cs.ucr.edu/~vahid This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend

2 Frank Vahid, UC Riverside 2 How Much is Enough?

3 Frank Vahid, UC Riverside 3 How Much is Enough? Perhaps a bit small

4 Frank Vahid, UC Riverside 4 How Much is Enough? Reasonably sized

5 Frank Vahid, UC Riverside 5 How Much is Enough? Probably plenty big

6 Frank Vahid, UC Riverside 6 How Much is Enough? More than typically necessary

7 Frank Vahid, UC Riverside 7 How Much is Enough? Very few people could use this

8 Frank Vahid, UC Riverside 8 How Much Custom Logic is Enough? 1993: ~ 1 million logic transistors IC packageIC Perhaps a bit small

9 Frank Vahid, UC Riverside 9 1996: ~ 5-8 million logic transistors Reasonably sized How Much Custom Logic is Enough?

10 Frank Vahid, UC Riverside 10 1999: ~ 10-50 million logic transistors Probably plenty big How Much Custom Logic is Enough?

11 Frank Vahid, UC Riverside 11 2002: ~ 100-200 million logic transistors More than typically necessary How Much Custom Logic is Enough?

12 Frank Vahid, UC Riverside 12 2008: >1 BILLION logic transistors 1993: 1 M Perhaps very few people could design this Point of diminishing returns 32-bit ARM: ~30K MPEG dcd: ~1M Other examples Fast cars (> 100 mph) High res digital cameras (> 4M) Disk space Even IC performance How Much Custom Logic is Enough?

13 Frank Vahid, UC Riverside 13 Very Few Companies Can Design High-End ICs Designer productivity growing at slower rate 1981: 100 designer months  ~$1M 2002: 30,000 designer months  ~$300M 10,000 1,000 100 10 1 0.1 0.01 0.001 Logic transistors per chip (in millions) 100,000 10,000 1000 100 10 1 0.1 0.01 Productivity (K) Trans./Staff-Mo. 198119831985198719891991199319951997199920012003200520072009 IC capacity productivity Gap Design productivity gap Source: ITRS’99

14 Frank Vahid, UC Riverside 14 Meanwhile, ICs Themselves are Costlier And take longer to fabricate While market windows are shrinking Less than 1,000 out of 10,000 ASIC designs have volumes to justify fabrication in 0.13 micron Tech:0.80.350.180.13 NRE:$40k$100k$350k$1,000k Turnaround42 days49 days56 days76 days Market:$3.5B$6B$12B$18B Source: DAC’01 panel on embedded programmable logic

15 Frank Vahid, UC Riverside 15 Summarizing So Far... * Transistors are less scarce ICs are big enough, fast enough * ICs take more time and money to design and fabricate While market windows are shrinking Buy pre-fabricated system-level ICs: platforms Designers

16 Frank Vahid, UC Riverside 16 Trend Towards Pre-Fabricated Platforms: ASSPs ASSP: application specific standard product Domain-specific pre- fabricated IC e.g., digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC Unique IC design Ignores quantity of same IC ASIC design starts decreasing Due to strong benefits of using pre-fabricated devices Source: Gartner/Dataquest September’01

17 Frank Vahid, UC Riverside 17 Will High End ICs Still be Made? YES The point is that mainstream designers likely won’t be making them Very high volume or very high cost products Platforms are one such product – high volume Need to be highly configurable to adapt to different applications and constraints Becoming out of reach of mainstream designers

18 Frank Vahid, UC Riverside 18 Configurable Platform Design: Cache uP L1 cache DSP JPEG dcd Periph- erals FPGA Pre-fabricated Platform (A pre-designed system-level architecture) IC ARM920T: Caches consume half of total power (Segars 01) M*CORE: Unified cache consumes half of total power (Lee/Moyer/Arends 99) L1 cache

19 Frank Vahid, UC Riverside 19 Best Cache Architecture for Embedded Systems Not clear Huge variety among popular embedded processors What’s the best… Associativity, Line size, Total size?

20 Frank Vahid, UC Riverside 20 Cache Associativity Direct mapped cache Certain bits “index” into cache Remaining “tag” bits compared 00 0 000 11 0 000 A B C D 01 0 000 10 0 000 Conflict 0000 D Tag 11 Direct mapped cache (1-way set associative) Index Set associative cache Multiple “ways” Fewer index bits, more tag bits, simultaneous comparisons More expensive, but better hit rate D110C100 2-way set associative cache 000

21 Frank Vahid, UC Riverside 21 Cache Associativity Reduces miss rate – thus improving performance Impact on power and energy? (Energy = Power * Time)

22 Frank Vahid, UC Riverside 22 Associativity is Costly Associativity improves hit rate, but at the cost of more power per access Are the power savings from reduced misses outweighed by the increased power per hit? Energy access breakdown for 8 Kbyte, 4-way set associative cache (considering dynamic power only) Energy per access for 8 Kbyte cache

23 Frank Vahid, UC Riverside 23 Associativity and Energy Best performing cache is not always lowest energy Significantly poorer energy

24 Frank Vahid, UC Riverside 24 Associativity Dilemma Direct mapped cache Good hit rate on most examples Low power per access But poor hit rate on some examples High power due to many misses Four-way set-associative cache Good hit rate on nearly all examples But high power per access Overkill for most examples, thus wasting energy Dilemma: Design for the average or worst case?

25 Frank Vahid, UC Riverside 25 Associativity Dilemma Obviously not a clear choice

26 Frank Vahid, UC Riverside 26 Our Solution: Configurable Cache Can be configured as 4, 2, or 1 way Ways can be concatenated D x 11xC10x 11 0 000 This bit selects the way 0000 Size can also be configured By shutting down ways Saves static power (leakage) D 0 110 11 0 000 0000

27 Frank Vahid, UC Riverside 27 Configurable Cache Design: Way Concatenation (4, 2 or 1 way) index c1c1 c3c3 c0c0 c2c2 a 11 a 12 reg 1 reg 0 sense amps column mux tag part tag address mux driver c1c1 line offset data output critical path c0c0 c2c2 c0c0 c1c1 6x64 c3c3 c2c2 c3c3 a 31 tag address a 13 a 12 a 11 a 10 index a 5 a 4 line offset a 0 Configuration circuit data array bitline Small area and performance overhead

28 Frank Vahid, UC Riverside 28 Configurable Cache Experiments Motorola PowerStone benchmark g3fax Way concatenate outperforms 4 way and direct map.

29 Frank Vahid, UC Riverside 29 Configurable Cache Experiments Configurable cache with both way concatenation and way shutdown was best on average Considered programs from Powerstone, MediaBench, and Spec2000 And, it was superior on every benchmark 100% = 4-way conventional cache

30 Frank Vahid, UC Riverside 30 Configurable Cache Experiments – Line Size Too Best line size also differs per example Our cache can be configured for line of 16, 32 or 64 bytes 64 is usually best; but 16 is much better in a couple cases A configurable cache with way concatenation, way shutdown, and variable line size, can save a lot of energy 100% = 4-way conventional cachecsb: concatenate plus shutdown cache

31 Frank Vahid, UC Riverside 31 Configurable Platform Use uP L1 cache DSP JPEG dcd Periph- erals FPGA Pre-fabricated Platform Platforms increasingly come with on-chip FPGA Can we use that FPGA to improve software performance and energy? IC FPGA uP

32 Frank Vahid, UC Riverside 32 Commercial Single-Chip Microprocessor/FPGA Platforms Triscend E5 chip Configurable logic 8051 processor plus other peripherals Memory Triscend E5: based on 8-bit 8051 CISC core 10 Dhrystone MIPS at 40MHz 60 kbytes on-chip RAM up to 40K logic gates Cost only about $4 (in volume)

33 Frank Vahid, UC Riverside 33 Single-Chip Microprocessor/FPGA Platforms Atmel FPSLIC Field-Programmable System-Level IC Based on AVR 8-bit RISC core 20 Dhrystone MIPS 5k-40k configurable logic gates On-chip RAM (20-36Kb) and EEPROM $5-$10 Courtesy of Atmel

34 Frank Vahid, UC Riverside 34 Single-Chip Microprocessor/FPGA Platforms Triscend A7 chip Based on ARM7 32- bit RISC processor 54 Dhrystone MIPS at 60 MHz Up to 40k logic gates On-chip cache and RAM $10-$20 in volume Courtesy of Triscend

35 Frank Vahid, UC Riverside 35 Single-Chip Microprocessor/FPGA Platforms Altera’s Excalibur EPXA 10 ARM (922T) hard core ~200 Dhrystone MIPS at ~200 MHz Devices range from ~200k to ~2 million programmable logic gates Source: www.altera.com

36 Frank Vahid, UC Riverside 36 Single-Chip Microprocessor/FPGA Platforms Xilinx Virtex II Pro PowerPC based 420 Dhrystone MIPS at 300 MHz 1 to 4 PowerPCs 4 to 16 gigabit transceivers 12 to 216 multipliers 3,000 to 50,000 logic cells 200k to 4M bits RAM 204 to 852 I/O $100-$500 (>25,000 units) Config. logic Up to 16 serial transceivers 622 Mbps to 3.125 Gbps622 Mbps to 3.125 Gbps PowerPCs Courtesy of Xilinx

37 Frank Vahid, UC Riverside 37 Why wouldn’t future microprocessor chips include some amount of on-chip FPGA? Single-Chip Microprocessor/FPGA Platforms

38 Frank Vahid, UC Riverside 38 Single-Chip Microprocessor/FPGA Platforms Lots of silicon area taken up by configurable logic As discussed earlier, less of an issue every year Smaller area doesn’t necessarily mean higher yield (lower costs) any more Previously could pack more die onto a wafer But die are becoming pad (pin) limited in nanoscale technologies Configurable logic typically used for peripherals, glue logic, etc. We have investigated another use...

39 Frank Vahid, UC Riverside 39 Software Improvements using On-Chip Configurable Logic Partitioned software critical loops onto on-chip FPGA for several benchmarks Most time spent in one or two loops Extensive simulated results for 8051 and MIPS For Powerstone (PS), MediaBench (MB) and Netbench (NB)

40 Frank Vahid, UC Riverside 40 Software Improvements using On-Chip Configurable Logic Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg)

41 Frank Vahid, UC Riverside 41 Speedup Gained with Relatively Few Gates Created several partitioned versions of each benchmarks Most speedup gained with first 20,000 gates Surprisingly few gates Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002 Stitt and Vahid, IEEE Design and Test, Dec. 2002 J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of Embedded Systems, 2002 (to appear).

42 Frank Vahid, UC Riverside 42 Software Improvements using On-Chip Configurable Logic – Verified through Physical Measurement Performed physical measurements on Triscend A7 and E5 devices Similar results (even a bit better) A7 IC Triscend A7 development board

43 Frank Vahid, UC Riverside 43 Other Types of Configurability Microprocessor (other researchers) VLIW configurations Voltage scaling Peripherals e.g., JPEG decoder with different precisions Bus topology Etc. uP L1 cache DSP JPEG dcd Periph- erals FPGA IC

44 Frank Vahid, UC Riverside 44 Conclusions Trend is away from semi-custom IC fabrication Pressures encourage buying pre-fabricated platforms Platforms must be highly configurable To be useful for a variety of applications, and hence mass produced We have discussed Software speedup/energy benefits of on-chip configurable logic: 3x speedups and 34% energy savings with only ~10,000 gates Creating a highly-configurable cache architecture: 40% energy savings compared to conventional cache Designing highly-configurable platforms, and facilitating their use with good exploration tools, can help enable platform-based design See http://www.cs.ucr.edu/~vahid for more informationhttp://www.cs.ucr.edu/~vahid


Download ppt "Frank Vahid, UC Riverside 1 New Opportunities with Platform Based Design Frank Vahid Associate Professor Dept. of Computer Science and Engineering University."

Similar presentations


Ads by Google