Presentation is loading. Please wait.

Presentation is loading. Please wait.

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.

Similar presentations


Presentation on theme: "Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center."— Presentation transcript:

1 Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Task ID: 1331.001 July 2005 – June 2008 Ph.D. students: Greg Stitt Ph.D. expected June 2006 Ann Gordon-Ross Ph.D. expected June 2006 David Sheldon Ph.D. expected 2009 Ryan Mannion Ph.D. expected 2009 Scott Sirowy Ph.D. expected 2010 Industrial Liaisons: Brian W. Einloth, Motorola Serge Rutman, Dave Clark, Intel Jeff Welser, IBM

2 2 Task Description Warp processing background Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from microprocessor to FPGA  10x perf./energy gains or more Task– Mature warp technology Years 1/2 (in progress) Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel) Years 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)

3 3 µPµP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary

4 4 µPµP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µPµP

5 5 µPµP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected

6 6 µPµP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD

7 7 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD decompiles critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0

8 8 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +...

9 9 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +... CLB SM ++ FPGA

10 10 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +... CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped”

11 11 Warp Processing Background: Trend Towards Processor/FPGA Programmable Platforms FPGAs with hard core processors FPGAs with soft core processors Computer boards with FPGAs Cray XD1. Source: FPGA journal, Apr’05 Xilinx Virtex II Pro. Source: XilinxAltera Excalibur. Source: Altera Xilinx Spartan. Source: Xilinx

12 12 Warp Processing Background: Trend Towards Processor/FPGA Programmable Platforms Programming a key challenge Soln 1: Compile high-level language to custom binaries Soln 2: Use standard binaries, dynamically re-map (warp) Cons: Less high-level information, less optimization Pros: Available to all software developers, not just specialists Data dependent optimization Most importantly, standard binaries enable “ecosystem” among tools, architecture, and applications Xilinx Virtex II Pro. Source: Xilinx Architectures Applications Tools Standard binaries Most significant concept presently absent in FPGAs and other new programmable platforms

13 13 uP I$ D$ FPGA Profiler On-chip CAD Warp Processing Background: Basic Technology Warp processing On-chip profiler Warp-tuned FPGA On-chip CAD, including Just-in- Time FPGA compilation Binary Partitioning Binary HW Behav./RT Synthesis Technology Mapping Placement & Routing Logic Synthesis Decompilation Binary Updater Binary Updated Binary JIT FPGA compilation

14 14 Warp Processing Background: Initial Results 60 MB 9.1 s Xilinx ISE Manually performed 3.6MB 0.2 s ROCCAD On a 75Mhz ARM7: only 1.4 s 46x improvement 30% perf. penalty Log. Syn. Tech. Map Place Route RT Syn. Decomp.Partitioning

15 15 Warp Processing Background: Publications 2002-2005 On-chip profiler Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware, A. Gordon-Ross and F. Vahid, ACM/IEEE Conf. on Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2003; Extended version of above in special issue “Best of CASES/MICRO” of IEEE Trans. on Comp., Oct 2005. Warp-tuned FPGA A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning, R. Lysecky and F. Vahid, Design Automation and Test in Europe Conf. (DATE), Feb 2004. On-chip CAD, including Just-in-Time FPGA compilation A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid and S. Tan. IEEE Symp. on Field-Programmable Custom Computing Machines (FCCM), 2005. A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning. R. Lysecky and F. Vahid. Design Automation and Test in Europe (DATE), March 2005. Dynamic FPGA Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid, and S. Tan. Design Automation Conf. (DAC), June 2004. A Codesigned On-Chip Logic Minimizer, R. Lysecky and F. Vahid, ISSS/CODES conf., Oct 2003. Dynamic Hardware/Software Partitioning: A First Approach. G. Stitt, R. Lysecky and F. Vahid, Design Automation Conf. (DAC), 2003. On-Chip Logic Minimization, R. Lysecky and F. Vahid, Design Automation Conf. (DAC), 2003. The Energy Advantages of Microprocessor Platforms with On-Chip Configurable Logic, G. Stitt and F. Vahid, IEEE Design and Test of Computers, Nov./Dec. 2002. Hardware/Software Partitioning of Software Binaries, G. Stitt and F. Vahid, IEEE/ACM International Conference on Computer Aided Design (ICCAD), Nov. 2002. Related A Self-Tuning Cache Architecture for Embedded Systems. C. Zhang, F. Vahid and R. Lysecky. ACM Transactions on Embedded Computing Systems (TECS), Vol. 3., Issue 2, May 2004. Fast Configurable-Cache Tuning with a Unified Second-Level Cache. A. Gordon-Ross, F. Vahid, N. Dutt. Int. Symp. on Low-Power Electronics and Design (ISLPED), 2005.

16 16 Task Description Warp processing background Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from microprocessor to FPGA  10x perf./energy gains or more Task– Mature warp technology Year 1 (in progress) Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel) Years 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)

17 17 Automatic High-Level Construct Recovery from Binaries Challenge: Binary lacks high-level constructs (loops, arrays,...) Decompilation can help recover Extensive previous work (e.g., [Cifuentes 93, 94, 99]) Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Control/Data Flow Graph Creation Original C Code Corresponding Assembly loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Data Flow Analysis long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } Function Recovery long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } Control Structure Recovery long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } Array Recovery Almost Identical Representations

18 18 New Method: Loop Rerolling Problem: Compiler unrolling of loops (to expose parallelism) causes synthesis problems: Huge input (slow), can’t unroll to desired amount, can’t use advanced loop methods (loop pipelining, fusion, splitting,...) Solution: New decompilation method: Loop Rerolling Identify unrolled iterations, compact into one iteration for (int i=0; i < 3; i++) accum += a[i]; Ld reg2, 100(0) Add reg1, reg1, reg2 Ld reg2, 100(1) Add reg1, reg1, reg2 Ld reg2, 100(2) Add reg1, reg1, reg2 Loop Unrolling for (int i=0; i<3;i++) reg1 += array[i]; Loop Rerolling

19 19 Loop Rerolling: Identify Unrolled Iterations x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x; Original C Code Find Consecutive Repeating Substrings: Adjacent Nodes with Same Substring Unrolled Loop 2 unrolled iterations Each iteration = abc (Ld, Add, St) Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4, r3 Binary x= x + 1; a[0] = b[0]+1; a[1] = b[1]+1; y = x; Unrolled Loop Add r3, r3, 1 => B Ld r0, b(0) => A Add r1, r0, 1 => B St a(0), r1 => C Ld r0, b(1) => A Add r1, r0, 1 => B St a(1), r1 => C Mov r4, r3 => D Map to String BABCABCD String Representation Find consecutively repeating instruction sequences abc c d b abcabcd c abcd d d d Suffix Tree Derived from bioinformatics techniques

20 20 Loop Rerolling: Compacting Iterations Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4, r3 Original C Code Unrolled Loop Identificiation Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4, r3 Determine relationship of constants 1) Add r3, r3, 1 i=0 loop: Ld r0, b(i) Add r1, r0, 1 St a(i), r1 Bne i, 2, loop Mov r4, r3 Replace constants with induction variable expression 2) reg3 = reg3 + 1; for (i=0; i < 2; i++) array1[i]=array2[i]+1; reg4=reg3; Rerolled, decompiled code 3) x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x;

21 21 Method: Strength Promotion Problem: Compiler’s strength reduction (replacing multiplies by shifts and adds) prevents synthesis from using hard-core multipliers, sometimes hurting circuit performance * B[i] 10 * B[i+1]18 * B[i+2] 34 * B[i+3]66 + + + A[i] + + + << B[i+1] 4 1 + << B[i]3 1 + << B[i+2] 5 1 + << B[i+3] 6 1 + A[i] FIR Filter Strength-Reduced FIR Filter Strength- reduced multiplication

22 22 Strength Promotion + + + << B[i+1] 4 1 + << B[i]3 1 + << B[i+2] 5 1 + << B[i+3] 6 1 + A[i] Identify strength- reduced subgraphs + + + << B[i+1] 4 1 + << B[i+2] 5 1 + << B[i+3] 6 1 + A[i] B[i]10 * Replace with multiplication + + + << B[i+2] 5 1 + << B[i+3] 6 1 + A[i] B[i]10 * B[i]18 * + + + << B[i+3] 6 1 + A[i] B[i]10 * B[i]18 * B[i]34 * + + + A[i] B[i]10 * B[i]18 * B[i]34 * B[i]66 * Strength promotion lets synthesis decide on strength reduction based on available resources 1 + + B[i+1]18 B[i]10 + << B[i+2] 5 1 + << B[i+3] 6 + A[i] * * Synthesis can of course apply strength reduction itself Solution: Promote strength-reduced code to muls

23 23 New Decompilation Methods’ Benefits Rerolling Speedups from better use of smart buffers Other potential benefits: faster synthesis, less area Strength promotion Speedups from fewer cycles Speedups from faster clock New methods to be developed e.g., pointer DS to arrays Speedups from Loop Rerolling Y axis = speedup, X axis = x_y_z => x adder constraint, y multiplier constraint, z = adders needed for reduction Y axis = clock frequency, X axis = adders needed for reduction

24 24 Decompilation is Effective Even with High Compiler-Optimization Levels Average Speedup of 10 Examples Speedups similar on MIPS for –O1 and –O3 optimizations Speedups similar on ARM for –O1 and –O3 optimizations Speedups similar between ARM and MIPS Complex instructions of ARM didn’t hurt synthesis MicroBlaze speedups much larger MicroBlaze is a slower microprocessor -O3 optimizations were very beneficial to hardware Publication: New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2005.

25 25 Task Description Warp processing background Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from microprocessor to FPGA  10x perf./energy gains or more Task– Mature warp technology Year 1 (in progress) Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel) Years 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)

26 26 Research Problem: Make Synthesis from Binaries Competitive with Synthesis from High-Level Languages Performed in-depth study with Freescale H.264 video decoder Highly-optimized proprietary code, not reference code Huge difference A benefit of SRC collaboration Research question: Is synthesis from binaries competitive on highly- optimized code? Several-month study MPEG 2 H.264: Better quality, or smaller files, using more computation

27 27 Optimized H.264 Larger than most benchmarks H.264: 16,000 lines Previous work: 100 to several thousand lines Highly-optimized H.264: Many man-hours of manual optimization 10x faster than reference code used in previous works Different profiling results Previous examples ~90% time in several loops H.264 ~90% time in ~45 functions Harder to speedup

28 28 C vs. Binary Synthesis on Opt. H.264 Binary partitioning competitive with source partitioning Speedups compared to ARM9 software Binary: 2.48, C: 2.53 Decompilation recovered nearly all high-level information needed for partitioning and synthesis Discovered another research problem: Why aren’t speedups (from binary or C) closer to “ideal” (0-time per fct)

29 29 Coding Guidelines Are there C-coding guidelines to improve partitioning speedups? Orthogonal to C vs. binary question Guidelines may help both Examined H.264 code further Several phone conferences with Freescale liasons, also several email exchanges and reports Competitive, but both could be better Coding guidelines get closer to ideal

30 30 Synthesis-Oriented Coding Guidelines Pass by value-return Declare a local array and copy in all data needed by a function (makes lack of aliases explicit) Function specialization Create function version having frequent parameter-values as constants void f(int width, int height ) {.... for (i=0; i < width, i++) for (j=0; j < height; j++)... } void f_4_4() {.... for (i=0; i < 4, i++) for (j=0; j < 4; j++)... } Bounds are explicit so loops are now unrollable Original Rewritten

31 31 Synthesis-Oriented Coding Guidelines Algorithmic specialization Use parallelizable hardware algorithms when possible Hoisting and sinking of error checking Keep error checking out of loops to enable unrolling Lookup table avoidance Use expressions rather than lookup tables int clip[512] = {... } void f() {... for (i=0; i < 10; i++) val[i] = clip[val[i]];... } void f() {... for (i=0; i < 10; i++) if (val[i] > 255) val[i] = 255; else if (val[i] < 0) val[i] = 0;... } val[1] < 0255 3x1 > val[1] val[0] < 0255 3x1 > val[0] OriginalRewritten... Comparisons can now be parallelized

32 32 Synthesis-Oriented Coding Guidelines Use explicit control flow Replace function pointers with if statements and static function calls void (*funcArray[]) (char *data) = { func1, func2,... }; void f(char *data) {... funcPointer = funcArray[i]; (*funcPointer) (data);... } void f(char *data) {... if (i == 0) func1(data); else if (i==1) func2(data);... } OriginalRewritten

33 33 Coding Guideline Results on H.264 Simple coding guidelines made large improvement Rewritten software only ~3% slower than original And, binary partitioning still competitive with C partitioning Speedups: Binary: 6.55, C: 6.56 Small difference caused by switch statements that used indirect jumps

34 34 Studied More Benchmarks, Developed More Guidelines Studied guidelines further on standard benchmarks Further synthesis speedups (again, independent of C vs. binary issue) Publications Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode. G. Stitt, F. Vahid, G. McGregor, B. Einloth. Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005 (joint publication with Freescale) Submitted: A Code Refinement Methodology for Performance-Improved Synthesis from C. G. Stitt, F. Vahid, W. Najjar, 2006. More guidelines to be developed

35 35 Task Description Warp processing background Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from microprocessor to FPGA  10x perf./energy gains or more Task– Mature warp technology Year 1 (in progress) Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel) Years 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)

36 36 Warp-Tailored FPGA Prototype Developed FPGA fabric tailored to fast/small- memory on-chip CAD Building chip prototype with Intel Created synthesizable VHDL models, running through Intel shuttle tool flow Plan to incorporate with ARM processor and other IP on shuttle seat Bi-weekly phone meetings with Intel engineers since summer 2005, ongoing, scheduled tapeout 2006 Q3 DADG LCH Configurable Logic Fabric 32-bit MAC SM CLB SM CLB SM CLB SM CLB LUT abcd e f o1o2o3o4 Adj. CLB Adj. CLB 0 0L 1 1L 2L 2 3L 3 0 1 2 3 0L 1L 2L 3L 0 1 2 3 0L1L2L3L 0123 0L1L2L 3L

37 37 Industrial Interactions Freescale Numerous phone conferences, emails, and reports, on technical subjects Co-authored paper (CODES/ISSS’05), another pending Summer internship – Scott Sirowy (new UCR graduate student), summer 2005, Austin Intel Three visits by PI, one by graduate student Roman Lysecky, to Intel Research in Santa Clara PI presented at Intel System Design Symposium, Nov. 2005 PI served on Intel Research Silicon Prototyping Workshop panel, May 2005 Participating in Intel’s Research Shuttle (chip prototype), bi-weekly phone conferences since summer 2005 involving PI, Intel engineers, and Roman Lysecky (now Prof. at UA) IBM Embarking on studies of warp processing results on server applications UCR group to receive Cell-based prototyping platform (w/ Prof. Walid Najjar) Several interactions with Xilinx also

38 38 Task Description – Coming Up Warp processing background Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from microprocessor to FPGA  10x perf./energy gains or more Task– Mature warp technology Years 1/2 (in progress) Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel) Years 2/3 – All three sub-tasks just now underway Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)

39 39 Recent Publications New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2005. Fast Configurable-Cache Tuning with a Unified Second-Level Cache. A. Gordon-Ross, F. Vahid, N. Dutt. Int. Symp. on Low-Power Electronics and Design (ISLPED), 2005. Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode. G. Stitt, F. Vahid, G. McGregor, B. Einloth. International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005. (Co-authored paper with Freescale) Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware. A. Gordon-Ross and F. Vahid. IEEE Trans. on Computers, Special Issue- Best of Embedded Systems, Microarchitecture, and Compilation Techniques in Memory of B. Ramakrishna (Bob) Rau, Oct. 2005. A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid and S. Tan. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2005. A First Look at the Interplay of Code Reordering and Configurable Caches. A. Gordon-Ross, F. Vahid, N. Dutt. Great Lakes Symposium on VLSI (GLSVLSI), April 2005. A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning. R. Lysecky and F. Vahid. Design Automation and Test in Europe (DATE), March 2005. A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid. Design Automation and Test in Europe (DATE), March 2005.


Download ppt "Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center."

Similar presentations


Ads by Google