Presentation is loading. Please wait.

Presentation is loading. Please wait.

Frank Vahid Associate Professor

Similar presentations


Presentation on theme: "Frank Vahid Associate Professor"— Presentation transcript:

1 Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs
Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend Frank Vahid, UC Riverside

2 General Purpose vs. Special Purpose
Amazing to think this came from wolves Standard tradeoff Oct. 14, 2002, Cincinnati, Ohio -- physician at Cincinnati Children’s Hospital Medical Center report duct tape effective at treating warts. Frank Vahid, UC Riverside

3 General Purpose vs. Single Purpose Processors
total = 0 for i = 1 to N loop total += M[i] end loop Designers have long known that: General-purpose processors are flexible Single-purpose processors are fast General purpose OR Single purpose IR PC Register file General ALU Datapath Controller Program memory Assembly code for: total = 0 for i =1 to … Control logic and State register Data memory Datapath Controller Control logic State register Data memory i total + ENIAC, 1940’s Its flexibility was the big deal Flexibility Design cost Time-to-market Performance Power efficiency Size Frank Vahid, UC Riverside

4 Mixing General and Single Purpose Processors
A.k.a. Hardware/software partitioning Hardware: single-purpose processors coprocessor, accelerator, peripheral, etc. Software: general-purpose processors Though hardware underneath! Especially important for embedded systems Computers embedded in devices (cameras, cars, toys, even people) Speed, cost, time-to-market, power, size, … demands are tough Microcontroller CCD preprocessor Pixel coprocessor A2D D2A JPEG codec DMA controller Memory controller ISA bus interface UART LCD control Display control Multiplier/Accumulator Digital camera chip lens CCD Frank Vahid, UC Riverside

5 How is Partitioning Done for Embedded Systems?
Partitioning into hw and sw blocks done early During conceptual stage Sw design done separately from hw design Attempts since late 1980s to automate not yet successful Partitioning manually is reasonably straightforward Spec is informal and not machine readable Sw algorithms may differ from hw algorithms No compelling need for tools Informal spec System Partitioning Sw spec Hw spec Sw design Hw design Processor ASIC Frank Vahid, UC Riverside

6 New Platforms Invite New Efforts in Hw/Sw Partitioning
New single-chip platforms contain both general-purpose processor and an FPGA FPGA: Field-programmable gate array Programmable just like software  Flexible Intended largely to implement single-purpose processors Can we perform a later partitioning to improve the software too? Processor + FPGA Informal spec System Partitioning Sw spec Hw spec Sw design Hw design Partitioning Processor + FPGA ASIC Frank Vahid, UC Riverside

7 Commercial Single-Chip Microprocessor/FPGA Platforms
Triscend E5 chip Configurable logic 8051 processor plus other peripherals Memory Triscend E5: based on 8-bit 8051 CISC core (2000) 10 Dhrystone MIPS at 40MHz up to 40K logic gates Cost only about $4 Frank Vahid, UC Riverside

8 Frank Vahid, UC Riverside
Single-Chip Microprocessor/FPGA Platforms Atmel FPSLIC Field-Programmable System-Level IC Based on AVR 8-bit RISC core 20 Dhrystone MIPS 5k-40k logic gates $5-$10 Courtesy of Atmel Frank Vahid, UC Riverside

9 Single-Chip Microprocessor/FPGA Platforms
Triscend A7 chip (2001) Based on ARM7 32-bit RISC processor 54 Dhrystone MIPS at 60 MHz Up to 40k logic gates $10-$20 in volume Courtesy of Triscend Frank Vahid, UC Riverside

10 Single-Chip Microprocessor/FPGA Platforms
Altera’s Excalibur EPXA 10 (2002) ARM (922T) hard core 200 Dhrystone MIPS at 200 MHz ~200k to ~2 million logic gates Source: Frank Vahid, UC Riverside

11 Single-Chip Microprocessor/FPGA Platforms
Xilinx Virtex II Pro (2002) PowerPC based 420 Dhrystone MIPS at 300 MHz 1 to 4 PowerPCs 4 to 16 gigabit transceivers 12 to 216 multipliers Millions of logic gates 200k to 4M bits RAM 204 to 852 I/O $100-$500 (>25,000 units) Up to 16 serial transceivers 622 Mbps to Gbps PowerPCs Config. logic Courtesy of Xilinx Frank Vahid, UC Riverside

12 Single-Chip Microprocessor/FPGA Platforms
Why wouldn’t future microprocessor chips include some amount of on-chip FPGA? One argument against – area Lots of silicon area taken up by FPGA FPGA about times less area efficient than custom logic FPGA used to be for prototyping, too big for final products But chip trends imply that FPGAs will be O.K. in final products… Frank Vahid, UC Riverside

13 Frank Vahid, UC Riverside
How Much is Enough? Perhaps a bit small Frank Vahid, UC Riverside

14 Frank Vahid, UC Riverside
How Much is Enough? Reasonably sized Frank Vahid, UC Riverside

15 How Much is Enough? Probably plenty big for most of us
Frank Vahid, UC Riverside

16 How Much is Enough? More than typically necessary
Frank Vahid, UC Riverside

17 How Much Custom Logic is Enough?
IC package IC 1993: ~ 1 million logic transistors Perhaps a bit small 8-bit processor: 50,000 tr. Pentium: 3 million tr. MPEG decoder: several million tr. Frank Vahid, UC Riverside

18 How Much Custom Logic is Enough?
1996: ~ 5-8 million logic transistors Reasonably sized Frank Vahid, UC Riverside

19 How Much Custom Logic is Enough?
1999: ~ million logic transistors Probably plenty big for most of us Frank Vahid, UC Riverside

20 How Much Custom Logic is Enough?
2002: ~ million logic transistors More than typically necessary Frank Vahid, UC Riverside

21 How Much Custom Logic is Enough?
2008: >1 BILLION logic transistors Perhaps very few people could design this Frank Vahid, UC Riverside

22 Very Few Companies Can Design High-End ICs
Design productivity gap 10,000 1,000 100 10 1 0.1 0.01 0.001 Logic transistors per chip (in millions) 100,000 1000 Productivity (K) Trans./Staff-Mo. 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 IC capacity productivity Gap Moore’s Law Source: ITRS’99 Designer productivity growing at slower rate 1981: 100 designer months  ~$1M 2002: 30,000 designer months  ~$300M Frank Vahid, UC Riverside

23 Single-Chip Platforms with On-Chip FPGAs
So, big FPGAs on-chip are O.K., because mainstream designers couldn’t have used all that silicon area anyways Becoming out of reach of mainstream designers But, couldn’t designers use custom logic instead of FPGAs to make smaller chips and save costs? Frank Vahid, UC Riverside

24 Shrinking Chips Yes, but there’s a limit Chips becoming pin limited
A football huddle can only get so small This area will exist whether we use it all or not Shrink Pads connecting to external pins Frank Vahid, UC Riverside

25 Trend Towards Pre-Fabricated Platforms: ASSPs
ASSP: application specific standard product Domain-specific pre-fabricated IC e.g., digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC Unique IC design Ignores quantity of same IC ASIC design starts decreasing Due to strong benefits of using pre-fabricated devices Source: Gartner/Dataquest September’01 Frank Vahid, UC Riverside

26 Microprocessor/FPGA Platforms
Trends point towards such platforms increasing in popularity Can we automatically partition the software to utilize the FPGA? For improved speed and energy Frank Vahid, UC Riverside

27 Automatic Hardware/Software Partitioning
Since late 1980s – goal has been spec in, hw/sw out But no successful commercial tool yet. Why? // From MediaBench’s JPEG codec GLOBAL(void) jpeg_fdct_ifast (DCTELEM * data) { DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7; DCTELEM tmp10, tmp11, tmp12, tmp13; DCTELEM z1, z2, z3, z4, z5, z11, z13; DCTELEM *dataptr; int ctr; SHIFT_TEMPS /* Pass 1: process rows. */ dataptr = data; for (ctr = DCTSIZE-1; ctr >= 0; ctr--) { tmp0 = dataptr[0] + dataptr[7]; tmp7 = dataptr[0] - dataptr[7]; tmp1 = dataptr[1] + dataptr[6]; // Thousands of lines like this in dozens of files Hardware “Spec” Partitioner Processor ASIC/FPGA Compilation Synthesis Software Ideal Software Frank Vahid, UC Riverside

28 Why No Successful Tool Yet?
Most research has focused on extensive exploration Roots in VLSI CAD Decompose problem into fine-grained operations Apply sophisticated partitioning algorithms Examples Min-cut, dynamic programming, simulated annealing, tabu-search, genetic evolution, etc. Is this overkill? “Spec” 1000s of nodes (like circuit partitioning) Partitioner Frank Vahid, UC Riverside

29 We Really Only Need Consider a Few Loops – Due to the 90-10 Rule
Recent appearance of embedded benchmark suites Enables analysis  understanding of the real problem We’ve examined UCLA’s MediaBench, Netbench, Motorola’s Powerstone Currently examining EEMBC (embedded equivalent of SPEC) UCR loop analysis tools based on SimpleScalar and Simics // From MediaBench’s JPEG codec GLOBAL(void) jpeg_fdct_ifast (DCTELEM * data) { DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7; DCTELEM tmp10, tmp11, tmp12, tmp13; DCTELEM z1, z2, z3, z4, z5, z11, z13; DCTELEM *dataptr; int ctr; SHIFT_TEMPS /* Pass 1: process rows. */ dataptr = data; for (ctr = DCTSIZE-1; ctr >= 0; ctr--) { tmp0 = dataptr[0] + dataptr[7]; tmp7 = dataptr[0] - dataptr[7]; tmp1 = dataptr[1] + dataptr[6]; 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 7 8 9 10 Assigned each loop a number, sorted by fraction of contribution to total execution time Frank Vahid, UC Riverside

30 The 90-10 Rule Holds for Embedded Systems
In fact, the most frequent loop alone took 50% of time, using 1% of code Frank Vahid, UC Riverside

31 So Need We Only Consider the First Few Loops? Not Necessarily
What if programs were self-similar w.r.t rule? Remove most frequent loop – rule still hold? Intuition might say yes – remove loop, and we have another program. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 7 8 9 10 Loop % Remaining Execution Time 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 7 8 9 10 Loop % Remaining Execution Time So we need only speedup the first few loops After that, speedups are limited Good from tool perspective! 500 1000 1 2 3 4 5 6 7 8 9 10 Loop Speedup 500 1000 1 2 3 4 5 6 7 8 9 10 Loop Speedup Frank Vahid, UC Riverside

32 Used multimeter and timer to measure performance and power
Manually Partitioned Several PowerStone Benchmarks onto Triscend A7 and E5 Chips E5 IC Used multimeter and timer to measure performance and power Obtained good speedups and energy savings by partitioning software among microprocessor and on-chip FPGA Triscend A7 development board Frank Vahid, UC Riverside

33 Simulation-Based Results for More Benchmarks
(Quicker than physical implementation, results matched reasonably well) Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg) Frank Vahid, UC Riverside

34 Looking at Multiple Loops per Benchmark
Manually created several partitioned versions of each benchmarks Most speedup gained with first 20,000 gates Surprisingly few gates! Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002 Stitt and Vahid, IEEE Design and Test, Dec. 2002 J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of Embedded Systems, 2002 (to appear). Frank Vahid, UC Riverside

35 Ideal Speedups for Different Architectures
Varied loop speedup ratio (sw time / hw time of loop itself) to see impact of faster microprocessor or slower FPGA: 30, 20, 10 (base case), 5 and 2 Loop speedups of 5 or more work fine for first few loops, not hard to achieve Frank Vahid, UC Riverside

36 Ideal Energy Savings for Different Architectures
Varied loop power ratio (FPGA power / microprocessor power) to account for different architectures – 2.5, 2.0, 1.5 (base case), 1.0 Energy savings quite resilient to variations Frank Vahid, UC Riverside

37 How is Automated Partitioning Done?
Informal spec Previous data obtained manually System Partitioning Sw spec Hw spec Sw design Hw design Partitioning Processor + FPGA ASIC Frank Vahid, UC Riverside

38 Source-Level Partitioning
SW Source _______ Front-end converts code into intermediate format, such as SUIF (Stanford University Intermediate Format) Compiler Front-End Intermediate format explored for hardware candidates Hw/Sw Partitioning Compiler Back-End Hw source Assembly & object files Binary is generated from assembling and linking. Hw source is generated and synthesized into netlist Assembler & Linker Synthesis Binary Netlists Processor FPGA Frank Vahid, UC Riverside

39 Problems with Source-Level Partitioning
Though technically superior, source-level partitioning Disrupts standard commercial tool flow significantly Requires special compiler (ouch!) Multiple source languages, changing source languages How deal with library code, assembly code, object code Compiler Front-end C Source C++ Source Java Source ? C SUIF Compiler C++ SUIF Compiler Frank Vahid, UC Riverside

40 Frank Vahid, UC Riverside
Binary Partitioning SW Source _______ Assembly & object files Compilation Source code is first compiled and linked in order to create a binary. Assembler & Linker Binary Candidate hardware regions (a few small, frequent loops) are decompiled for partitioning Hw/Sw Partitioning Hw source Updated Binary HDL is generated and synthesized, and binary is updated to use hardware Synthesis Netlists Processor FPGA Frank Vahid, UC Riverside

41 Binary-Level Partitioning Results (ICCAD’02)
Source-Level Average speedup, 1.5 Average energy savings, 27% Average 4,361 gates Binary-Level Average speedup, 1.4 Average energy savings, 13% Large area overhead averaging 10,325 gates Frank Vahid, UC Riverside

42 Frank Vahid, UC Riverside
Binary Partitioning Could Eventually Lead to Dynamic Hw/Sw Partitioning Dynamic software optimization gaining interest e.g., HP’s Dynamo What better optimization than moving to FPGA? Add component on-chip: Detects most frequent sw loops Decompiles a loop Performs compiler optimizations Synthesizes to a netlist Places and routes the netlist onto (simple) FPGA Updates sw to call FPGA Config. Logic Mem Processor DMA D$ I$ Profiler Proc. Self-improving IC Can be invisible to designer Appears as efficient processor HARD! Much future work. Frank Vahid, UC Riverside

43 Frank Vahid, UC Riverside
Conclusions Hardware/software partitioning can significantly improve software speed and energy Single-chip microprocessor/FPGA platforms, increasing in popularity, make such partitioning even more attractive Successful commercial tool still on the horizon Binary-level partitioning may help in some cases Source-level can yield massive parallelism (Profs. Najjar/Payne) Future dynamic hw/sw partitioning possible? Distinction between sw/hw continually being blurred! Many people involved: Greg Stitt, Roman Lysecky, Shawn Nematbakhsh, Dinesh Suresh, Walid Najjar, Jason Villarreal, Tom Payne, several others… Support from NSF, Triscend, and soon SRC… Exciting new directions! Frank Vahid, UC Riverside


Download ppt "Frank Vahid Associate Professor"

Similar presentations


Ads by Google