Frank Vahid Associate Professor

Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs
Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend Frank Vahid, UC Riverside

General Purpose vs. Special Purpose
Amazing to think this came from wolves Standard tradeoff Oct. 14, 2002, Cincinnati, Ohio -- physician at Cincinnati Children’s Hospital Medical Center report duct tape effective at treating warts. Frank Vahid, UC Riverside

General Purpose vs. Single Purpose Processors
total = 0 for i = 1 to N loop total += M[i] end loop Designers have long known that: General-purpose processors are flexible Single-purpose processors are fast General purpose OR Single purpose IR PC Register file General ALU Datapath Controller Program memory Assembly code for: total = 0 for i =1 to … Control logic and State register Data memory Datapath Controller Control logic State register Data memory i total + ENIAC, 1940’s Its flexibility was the big deal Flexibility Design cost Time-to-market Performance Power efficiency Size Frank Vahid, UC Riverside

Mixing General and Single Purpose Processors
A.k.a. Hardware/software partitioning Hardware: single-purpose processors coprocessor, accelerator, peripheral, etc. Software: general-purpose processors Though hardware underneath! Especially important for embedded systems Computers embedded in devices (cameras, cars, toys, even people) Speed, cost, time-to-market, power, size, … demands are tough Microcontroller CCD preprocessor Pixel coprocessor A2D D2A JPEG codec DMA controller Memory controller ISA bus interface UART LCD control Display control Multiplier/Accumulator Digital camera chip lens CCD Frank Vahid, UC Riverside

How is Partitioning Done for Embedded Systems?
Partitioning into hw and sw blocks done early During conceptual stage Sw design done separately from hw design Attempts since late 1980s to automate not yet successful Partitioning manually is reasonably straightforward Spec is informal and not machine readable Sw algorithms may differ from hw algorithms No compelling need for tools Informal spec System Partitioning Sw spec Hw spec Sw design Hw design Processor ASIC Frank Vahid, UC Riverside

New Platforms Invite New Efforts in Hw/Sw Partitioning
New single-chip platforms contain both general-purpose processor and an FPGA FPGA: Field-programmable gate array Programmable just like software  Flexible Intended largely to implement single-purpose processors Can we perform a later partitioning to improve the software too? Processor + FPGA Informal spec System Partitioning Sw spec Hw spec Sw design Hw design Partitioning Processor + FPGA ASIC Frank Vahid, UC Riverside

Commercial Single-Chip Microprocessor/FPGA Platforms
Triscend E5 chip Configurable logic 8051 processor plus other peripherals Memory Triscend E5: based on 8-bit 8051 CISC core (2000) 10 Dhrystone MIPS at 40MHz up to 40K logic gates Cost only about $4 Frank Vahid, UC Riverside

Frank Vahid, UC Riverside
Single-Chip Microprocessor/FPGA Platforms Atmel FPSLIC Field-Programmable System-Level IC Based on AVR 8-bit RISC core 20 Dhrystone MIPS 5k-40k logic gates $5-$10 Courtesy of Atmel Frank Vahid, UC Riverside

Single-Chip Microprocessor/FPGA Platforms
Triscend A7 chip (2001) Based on ARM7 32-bit RISC processor 54 Dhrystone MIPS at 60 MHz Up to 40k logic gates $10-$20 in volume Courtesy of Triscend Frank Vahid, UC Riverside

Altera’s Excalibur EPXA 10 (2002) ARM (922T) hard core 200 Dhrystone MIPS at 200 MHz ~200k to ~2 million logic gates Source: Frank Vahid, UC Riverside

Xilinx Virtex II Pro (2002) PowerPC based 420 Dhrystone MIPS at 300 MHz 1 to 4 PowerPCs 4 to 16 gigabit transceivers 12 to 216 multipliers Millions of logic gates 200k to 4M bits RAM 204 to 852 I/O $100-$500 (>25,000 units) Up to 16 serial transceivers 622 Mbps to Gbps PowerPCs Config. logic Courtesy of Xilinx Frank Vahid, UC Riverside

Why wouldn’t future microprocessor chips include some amount of on-chip FPGA? One argument against – area Lots of silicon area taken up by FPGA FPGA about times less area efficient than custom logic FPGA used to be for prototyping, too big for final products But chip trends imply that FPGAs will be O.K. in final products… Frank Vahid, UC Riverside

How Much is Enough? Perhaps a bit small Frank Vahid, UC Riverside

How Much is Enough? Reasonably sized Frank Vahid, UC Riverside

How Much is Enough? Probably plenty big for most of us
Frank Vahid, UC Riverside

How Much is Enough? More than typically necessary
Frank Vahid, UC Riverside

How Much Custom Logic is Enough?
IC package IC 1993: ~ 1 million logic transistors Perhaps a bit small 8-bit processor: 50,000 tr. Pentium: 3 million tr. MPEG decoder: several million tr. Frank Vahid, UC Riverside

1996: ~ 5-8 million logic transistors Reasonably sized Frank Vahid, UC Riverside

1999: ~ million logic transistors Probably plenty big for most of us Frank Vahid, UC Riverside

2002: ~ million logic transistors More than typically necessary Frank Vahid, UC Riverside

2008: >1 BILLION logic transistors Perhaps very few people could design this Frank Vahid, UC Riverside

Very Few Companies Can Design High-End ICs
Design productivity gap 10,000 1,000 100 10 1 0.1 0.01 0.001 Logic transistors per chip (in millions) 100,000 1000 Productivity (K) Trans./Staff-Mo. 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 IC capacity productivity Gap Moore’s Law Source: ITRS’99 Designer productivity growing at slower rate 1981: 100 designer months  ~$1M 2002: 30,000 designer months  ~$300M Frank Vahid, UC Riverside

Single-Chip Platforms with On-Chip FPGAs
So, big FPGAs on-chip are O.K., because mainstream designers couldn’t have used all that silicon area anyways Becoming out of reach of mainstream designers But, couldn’t designers use custom logic instead of FPGAs to make smaller chips and save costs? Frank Vahid, UC Riverside

Shrinking Chips Yes, but there’s a limit Chips becoming pin limited
A football huddle can only get so small This area will exist whether we use it all or not Shrink Pads connecting to external pins Frank Vahid, UC Riverside

Trend Towards Pre-Fabricated Platforms: ASSPs
ASSP: application specific standard product Domain-specific pre-fabricated IC e.g., digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC Unique IC design Ignores quantity of same IC ASIC design starts decreasing Due to strong benefits of using pre-fabricated devices Source: Gartner/Dataquest September’01 Frank Vahid, UC Riverside

Microprocessor/FPGA Platforms
Trends point towards such platforms increasing in popularity Can we automatically partition the software to utilize the FPGA? For improved speed and energy Frank Vahid, UC Riverside

Automatic Hardware/Software Partitioning
Since late 1980s – goal has been spec in, hw/sw out But no successful commercial tool yet. Why? // From MediaBench’s JPEG codec GLOBAL(void) jpeg_fdct_ifast (DCTELEM * data) { DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7; DCTELEM tmp10, tmp11, tmp12, tmp13; DCTELEM z1, z2, z3, z4, z5, z11, z13; DCTELEM *dataptr; int ctr; SHIFT_TEMPS /* Pass 1: process rows. */ dataptr = data; for (ctr = DCTSIZE-1; ctr >= 0; ctr--) { tmp0 = dataptr[0] + dataptr[7]; tmp7 = dataptr[0] - dataptr[7]; tmp1 = dataptr[1] + dataptr[6]; … // Thousands of lines like this in dozens of files Hardware “Spec” Partitioner Processor ASIC/FPGA Compilation Synthesis Software Ideal Software Frank Vahid, UC Riverside

Why No Successful Tool Yet?
Most research has focused on extensive exploration Roots in VLSI CAD Decompose problem into fine-grained operations Apply sophisticated partitioning algorithms Examples Min-cut, dynamic programming, simulated annealing, tabu-search, genetic evolution, etc. Is this overkill? “Spec” 1000s of nodes (like circuit partitioning) Partitioner Frank Vahid, UC Riverside

We Really Only Need Consider a Few Loops – Due to the 90-10 Rule
Recent appearance of embedded benchmark suites Enables analysis  understanding of the real problem We’ve examined UCLA’s MediaBench, Netbench, Motorola’s Powerstone Currently examining EEMBC (embedded equivalent of SPEC) UCR loop analysis tools based on SimpleScalar and Simics // From MediaBench’s JPEG codec GLOBAL(void) jpeg_fdct_ifast (DCTELEM * data) { DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7; DCTELEM tmp10, tmp11, tmp12, tmp13; DCTELEM z1, z2, z3, z4, z5, z11, z13; DCTELEM *dataptr; int ctr; SHIFT_TEMPS /* Pass 1: process rows. */ dataptr = data; for (ctr = DCTSIZE-1; ctr >= 0; ctr--) { tmp0 = dataptr[0] + dataptr[7]; tmp7 = dataptr[0] - dataptr[7]; tmp1 = dataptr[1] + dataptr[6]; … 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 7 8 9 10 Assigned each loop a number, sorted by fraction of contribution to total execution time Frank Vahid, UC Riverside

The 90-10 Rule Holds for Embedded Systems
In fact, the most frequent loop alone took 50% of time, using 1% of code Frank Vahid, UC Riverside

So Need We Only Consider the First Few Loops? Not Necessarily
What if programs were self-similar w.r.t rule? Remove most frequent loop – rule still hold? Intuition might say yes – remove loop, and we have another program. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 7 8 9 10 Loop % Remaining Execution Time 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 7 8 9 10 Loop % Remaining Execution Time So we need only speedup the first few loops After that, speedups are limited Good from tool perspective! 500 1000 1 2 3 4 5 6 7 8 9 10 Loop Speedup 500 1000 1 2 3 4 5 6 7 8 9 10 Loop Speedup Frank Vahid, UC Riverside

Used multimeter and timer to measure performance and power
Manually Partitioned Several PowerStone Benchmarks onto Triscend A7 and E5 Chips E5 IC Used multimeter and timer to measure performance and power Obtained good speedups and energy savings by partitioning software among microprocessor and on-chip FPGA Triscend A7 development board Frank Vahid, UC Riverside

Simulation-Based Results for More Benchmarks
(Quicker than physical implementation, results matched reasonably well) Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg) Frank Vahid, UC Riverside

Looking at Multiple Loops per Benchmark
Manually created several partitioned versions of each benchmarks Most speedup gained with first 20,000 gates Surprisingly few gates! Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002 Stitt and Vahid, IEEE Design and Test, Dec. 2002 J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of Embedded Systems, 2002 (to appear). Frank Vahid, UC Riverside

Ideal Speedups for Different Architectures
Varied loop speedup ratio (sw time / hw time of loop itself) to see impact of faster microprocessor or slower FPGA: 30, 20, 10 (base case), 5 and 2 Loop speedups of 5 or more work fine for first few loops, not hard to achieve Frank Vahid, UC Riverside

Ideal Energy Savings for Different Architectures
Varied loop power ratio (FPGA power / microprocessor power) to account for different architectures – 2.5, 2.0, 1.5 (base case), 1.0 Energy savings quite resilient to variations Frank Vahid, UC Riverside

How is Automated Partitioning Done?
Informal spec Previous data obtained manually System Partitioning Sw spec Hw spec Sw design Hw design Partitioning Processor + FPGA ASIC Frank Vahid, UC Riverside

Source-Level Partitioning
SW Source _______ Front-end converts code into intermediate format, such as SUIF (Stanford University Intermediate Format) Compiler Front-End Intermediate format explored for hardware candidates Hw/Sw Partitioning Compiler Back-End Hw source Assembly & object files Binary is generated from assembling and linking. Hw source is generated and synthesized into netlist Assembler & Linker Synthesis Binary Netlists Processor FPGA Frank Vahid, UC Riverside

Problems with Source-Level Partitioning
Though technically superior, source-level partitioning Disrupts standard commercial tool flow significantly Requires special compiler (ouch!) Multiple source languages, changing source languages How deal with library code, assembly code, object code Compiler Front-end C Source C++ Source Java Source ? C SUIF Compiler C++ SUIF Compiler Frank Vahid, UC Riverside

Binary Partitioning SW Source _______ Assembly & object files Compilation Source code is first compiled and linked in order to create a binary. Assembler & Linker Binary Candidate hardware regions (a few small, frequent loops) are decompiled for partitioning Hw/Sw Partitioning Hw source Updated Binary HDL is generated and synthesized, and binary is updated to use hardware Synthesis Netlists Processor FPGA Frank Vahid, UC Riverside

Binary-Level Partitioning Results (ICCAD’02)
Source-Level Average speedup, 1.5 Average energy savings, 27% Average 4,361 gates Binary-Level Average speedup, 1.4 Average energy savings, 13% Large area overhead averaging 10,325 gates Frank Vahid, UC Riverside

Binary Partitioning Could Eventually Lead to Dynamic Hw/Sw Partitioning Dynamic software optimization gaining interest e.g., HP’s Dynamo What better optimization than moving to FPGA? Add component on-chip: Detects most frequent sw loops Decompiles a loop Performs compiler optimizations Synthesizes to a netlist Places and routes the netlist onto (simple) FPGA Updates sw to call FPGA Config. Logic Mem Processor DMA D$ I$ Profiler Proc. Self-improving IC Can be invisible to designer Appears as efficient processor HARD! Much future work. Frank Vahid, UC Riverside

Conclusions Hardware/software partitioning can significantly improve software speed and energy Single-chip microprocessor/FPGA platforms, increasing in popularity, make such partitioning even more attractive Successful commercial tool still on the horizon Binary-level partitioning may help in some cases Source-level can yield massive parallelism (Profs. Najjar/Payne) Future dynamic hw/sw partitioning possible? Distinction between sw/hw continually being blurred! Many people involved: Greg Stitt, Roman Lysecky, Shawn Nematbakhsh, Dinesh Suresh, Walid Najjar, Jason Villarreal, Tom Payne, several others… Support from NSF, Triscend, and soon SRC… Exciting new directions! Frank Vahid, UC Riverside

Frank Vahid Associate Professor

Similar presentations

Presentation on theme: "Frank Vahid Associate Professor"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Frank Vahid Associate Professor

Similar presentations

Presentation on theme: "Frank Vahid Associate Professor"— Presentation transcript:

Similar presentations

About project

Feedback