Presentation is loading. Please wait.

Presentation is loading. Please wait.

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.

Similar presentations


Presentation on theme: "Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center."— Presentation transcript:

1 Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Task ID: 1331.001 July 2005 – June 2008 Ph.D. students: Greg Stitt Ph.D. expected June 2007 Ann Gordon-Ross Ph.D. expected June 2007 David Sheldon Ph.D. expected 2009 Ryan Mannion Ph.D. expected 2009 Scott Sirowy Ph.D. expected 2010 Industrial Liaisons: Brian W. Einloth, Motorola Serge Rutman, Dave Clark, Darshan Patra, Intel Jeff Welser, Scott Lekuch, IBM

2 Frank Vahid, UCR 2 Task Description Warp processing background Idea: Invisibly move binary regions from microprocessor to FPGA  10x speedups or more, energy gains too Task– Mature warp technology Years 1/2 Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Warp-tailored FPGA prototype (with Intel) Year 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)

3 Frank Vahid, UCR 3 Microprocessors plus FPGAs Xilinx Virtex II Pro. Source: XilinxAltera Excalibur. Source: Altera Cray XD1. Source: FPGA journal, Apr’05 Speedups of 10x-1000x Embedded, desktop, and supercomputing More platforms w/ uP and FPGA Xilinx, Altera, … Cray, SGI Mitrionics IBM Cell (research)

4 Frank Vahid, UCR 4 “Traditional” Compilation for uP/FPGAs Specialized language or compiler SystemC, NapaC, HandelC, Spark, ROCCC, CatapultC, Streams-C, DEFACTO, … Commercial success still limited Sw developers reluctant to change languages/tools But still very promising Libraries/ Object Code Libraries/ Object Code Updated Binary High-level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Non- Standard Software Tool Flow Updated Binary Specialized Language Decompilation Specialized Compiler

5 Frank Vahid, UCR 5 Warp Processing – “Invisible” Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-Level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software 2002 – Sought to make synthesis more “invisible” Began “Synthesis from Binaries” project Decompilation Synthesis Decompilation Compiler Updated Binary High-level Code Libraries/ Object Code Libraries/ Object Code Updated Binary Software Binary Hardware Software Move compilation before synthesis Standard Software Tool Flow

6 Frank Vahid, UCR 6 Warp Processing – Dynamic Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-Level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Decompilation Synthesis Decompilation Compiler Updated Binary High-level Code Libraries/ Object Code Libraries/ Object Code Updated Binary Software Binary Hardware Software Obtained circuits were competitive 2003: Runtime? Like binary translation (x86 to VLIW), more aggressive Benefits Language/tool independent Library code OK Portable binaries Dynamic optimizations FPGA becomes transparent performance hardware, like memory Warp processor looks like standard uP but invisibly synthesizes hardware

7 Frank Vahid, UCR 7 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary

8 Frank Vahid, UCR 8 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µP

9 Frank Vahid, UCR 9 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected

10 Frank Vahid, UCR 10 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD

11 Frank Vahid, UCR 11 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD converts critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0

12 Frank Vahid, UCR 12 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +...

13 Frank Vahid, UCR 13 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +... CLB SM ++ FPGA

14 Frank Vahid, UCR 14 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +... CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped” Feasible for repeating or long- running applications

15 Frank Vahid, UCR 15 Task Description Warp processing background Idea: Invisibly move binary regions from microprocessor to FPGA  10x speedups or more, energy gains too Task– Mature warp technology Years 1/2 Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Warp-tailored FPGA prototype (with Intel) Year 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)

16 Frank Vahid, UCR 16 Synthesis from Binaries can be Surprisingly Competitive Only small difference in speedup With aggressive decompilation Previous techniques, plus newly-created ones

17 Frank Vahid, UCR 17 Decompilation is Effective Even with High Compiler-Optimization Levels Average Speedup of 10 Examples Publication: New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2005. Do compiler optimizations generate binaries harder to effectively decompile? (Surprisingly) found opposite – optimized code even better

18 Frank Vahid, UCR 18 Task Description Warp processing background Idea: Invisibly move binary regions from microprocessor to FPGA  10x speedups or more, energy gains too Task– Mature warp technology Years 1/2 Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Warp-tailored FPGA prototype (with Intel) Year 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)

19 Frank Vahid, UCR 19 Several Month Study with Freescale Optimized H.264 Proprietary code Different from reference code 10x faster 16,000 lines ~90% time in 45 distinct functions rather than 2-3

20 Frank Vahid, UCR 20 Several Month Study with Freescale Binary synthesis competitive with high level Pub: Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode. G. Stitt, F. Vahid, G. McGregor, B. Einloth, CODES/ISSS Sep. 2005.

21 Frank Vahid, UCR 21 However – Ideal Speedup Much Larger How bring both approaches closer to ideal? Unanticipated sub-task Large difference between ideal speedup and actual speedup

22 Frank Vahid, UCR 22 C-Level Coding Guidelines Are there simple coding guidelines that improve synthesized hardware? Orthogonal to synthesis from high-level or binary issue Studied dozens of embedded applications and identified bottlenecks Memory bandwidth Use of pointers Software algorithms Defined ~10 basic guidelines (e.g., avoid function pointers, use constants, …) Closer to ideal Pub: A Code Refinement Methodology for Performance-Improved Synthesis from C. G. Stitt, F. Vahid, W. Najjar. IEEE/ACM Int. Conf. on Computer-Aided Design (ICCAD), Nov. 2006.

23 Frank Vahid, UCR 23 Task Description Warp processing background Idea: Invisibly move binary regions from microprocessor to FPGA  10x speedups or more, energy gains too Task– Mature warp technology Years 1/2 Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Warp-tailored FPGA prototype (with Intel) Year 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)

24 Frank Vahid, UCR 24 Warp-Tailored FPGA Prototype One-year effort developed FPGA fabric tailored to fast/small-memory on-chip CAD Bi-weekly phone meetings for 5 months plus several day visit to Intel Created synthesizable VHDL models, in Intel shuttle tool flow, in 0.13 micron technology, simulated and verified at post-layout (Unfortunately, Intel cancelled entire shuttle program, just before out tapeout) DADG LCH Configurable Logic Fabric 32-bit MAC SM CLB SM CLB SM CLB SM CLB LUT abcd e f o1o2o3o4 Adj. CLB Adj. CLB 0 0L 1 1L 2L 2 3L 3 0 1 2 3 0L 1L 2L 3L 0 1 2 3 0L1L2L3L 0123 0L1L2L 3L

25 Frank Vahid, UCR 25 Task Description Warp processing background Idea: Invisibly move binary regions from microprocessor to FPGA  10x speedups or more, energy gains too Task– Mature warp technology Years 1/2 Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Warp-tailored FPGA prototype (with Intel) Year 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)

26 Frank Vahid, UCR 26 Smart Buffers State-of-the-art FPGA compilers use several advanced methods e.g., ROCCC Riverside Optimizing Compiler for Configurable Computing [Guo, Buyukkurt, Najjar, LCTES 2004] SmartBuffer Compiler analyzes memory access patterns Determines size of window and stride Creates custom self-updating buffer, "pushes" data into datapath Helps alleviate memory bottleneck problem Smart Buffer Block RAM Input Address Generator Datapath Block RAM Output Address Generator Task Trigger Write Buffer

27 Frank Vahid, UCR 27 Smart Buffers A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] …. 1 st iteration window 2nd iteration window 3rd iteration window Void fir() { for (int i=0; i < 50; i ++) { B[i] = C0 * A[i] + C1 *A[i+1] + C2 * A[i+2] + C3 * A[i+3]; } A[0]A[1]A[2] A[3] A[1]A[2]A[3] A[4] A[2]A[3]A[4] A[5] Smart Buffer Killed *Elements in bold are read from memory Etc.

28 Frank Vahid, UCR 28 Recovering Arrays from Binaries Arrays and memory access patterns needed Array recovery from binaries Search loops for memory accesses with linear patterns Other access patterns are possible but rare (e.g., array[i*i]) Array bounds determined from loop bounds and induction variables

29 Frank Vahid, UCR 29 Recovery of Arrays + Determine induction variable: reg3 Find array address calculations Element size specified by shift or multiplication amount Find base address from reg2 definition Reg2 corresponds to array base address Determine array bounds from loop bounds for ( ) { reg3=0;reg3 < 10;reg3++ long array[10]; for (reg3=0; reg3 < 10; reg++) reg4 += array[reg3]; << reg3 2 + reg2 Memory Read 1 reg3 reg4 + }

30 Frank Vahid, UCR 30 Recovery of Arrays i*element_size*width j*element_size + base + addr for (i=0; i < 10; i++) { for (j=0; j < 10; j++) { } } i*element_size*width+base j*element_size + addr for (i=0; i < 10; i++) { for (j=0; j < 10; j++) { } } Multidimensional recovery is more difficult Example: array[i][j] can be implemented many ways

31 Frank Vahid, UCR 31 Recovery of Arrays Multidimensional array recovery Use heuristics to find row major ordering calculations Compilers can implement RMO in many ways Dependent on the optimization potential of the application Hard to check every possible way Check for common possibilities So far able to recover multidimensional arrays for all but one example Success with dozens of benchmarks Bounds of each array dimension determined from bounds of inner and outer loop

32 Frank Vahid, UCR 32 Experimental Setup Two experiments Compare binary synthesis with and without smart buffers Compare synthesis from binary and from C-level source, both with smart buffers Used our UCR decompilation tool 30,000 lines of C code Outputs decompiled C code Synthesized from C using ROCCC and Xilinx tools Xilinx XC2V2000 FPGA Software Binary (ARM) C Code GCC –O1 Decompilation Recovered C Code ROCCC Controller Smart Buffer Datapath Smart Buffer Netlist

33 Frank Vahid, UCR 33 Binary Synthesis with and without SmartBuffer Used examples from past ROCCC work SmartBuffer: Significant speedups Shows criticality of memory bottleneck problem

34 Frank Vahid, UCR 34 Synthesis from Binary versus from Original C From C vs. from binary – nearly same results One example even better (due to gcc optimization) Area overhead due to strength-reduced operators and extra registers ROCCC gcc –O1, decompile, ROCCC Pub: Techniques for Synthesizing Binaries to an Advanced Register/Memory Structure. G. Stitt, Z. Guo, F. Vahid, and W. Najjar. ACM/SIGDA Symp. on Field Programmable Gate Arrays (FPGA), Feb. 2005, pp. 118-124.

35 Frank Vahid, UCR 35 Task Description Warp processing background Idea: Invisibly move binary regions from microprocessor to FPGA  10x speedups or more, energy gains too Task– Mature warp technology Years 1/2 Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Warp-tailored FPGA prototype (with Intel) Year 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (w/ Freescale) Consider desktop/server domains (with IBM)

36 Frank Vahid, UCR 36 Domain-Specific FPGA Question: To what extent can customizing FPGA fabric impact delay and area? Relevant for FPGA fabrics forming part of ASIC or SoC, for sub-circuits subject to change Used VPR (Versatile Place & Route) for Xilinx Spartan-like fabrics Varied LUT sizes, LUTs per CLB, and switch matrix parameters Pseudo-exhaustive exploration on 9 MCNC circuit benchmarks Pareto points show interesting delay/area tradeoffs SM CLB SM CLB SM CLB SM CLB

37 Frank Vahid, UCR 37 Domain-Specific FPGA Compared customized fabric to best average fabric Three experiments: Delay only, Area only, Delay*Area Benefits understated – avg is for 9 benchmarks, not larger set for which off-the-shelf FPGA fabrics are designed Delay – up to 50% gain, at cost of area Area – up to 60% gain, plus delay benefits

38 Frank Vahid, UCR 38 Task Description Warp processing background Idea: Invisibly move binary regions from microprocessor to FPGA  10x speedups or more, energy gains too Task– Mature warp technology Years 1/2 Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Warp-tailored FPGA prototype (with Intel) Year 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)

39 Frank Vahid, UCR 39 Consider Desktop/Server Domains Investigated warp processing for SPEC benchmarks But little speedup from hw/sw partitioning Due to data structures, file I/O, library functions,... Server benchmark Studied Apache server Too disk intensive, could not attain significant speedups Multiprocessing benchmarks Promising direction for warp processing

40 Frank Vahid, UCR 40 Multiprocessing Platforms Running Multiple Threads – Use Warp Processing to Synthesize Thread Accelerators on FPGA Profiler µP Warp Tools Warp FPGA µP OS a( ) b( ) for (i=0; i < 10; i++) createThread( b ); Function a( ) OS Thread Queue b( ) Warp Tools b( ) Warp FPGA b( ) OS can only schedule 2 threads Remaining 8 threads placed in thread queue Warp tools create custom accelerators for b( ) OS schedules 4 threads to custom accelerators

41 Frank Vahid, UCR 41 Multiprocessing Platforms Running Multiple Threads – Use Warp Processing to Synthesize Thread Accelerators on FPGA Profiler µP Warp Tools Warp FPGA µP OS a( ) b( ) for (i=0; i < 10; i++) createThread( b ); Function a( ) Warp Tools b( ) Profiler Profiler detects performance critical loop in b( ) Warp FPGA b( ) Warp tools create larger/faster accelerators b( )

42 Frank Vahid, UCR 42 Warp Processing to Synthesize Thread Accelerators on FPGA Multi-threaded warp 120x faster than 4-uP (ARM) system Created simulation framework >10,000 lines of code Plus SimpleScalar Apps must be long-running (e.g., scientific apps running for days) or repeating for synthesis times to be acceptable

43 Frank Vahid, UCR 43 Multiprocessor Warp Processing – Additional Benefits due to Custom Communication µP NoC – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent Bus Mesh Bus Mesh App1 App2

44 Frank Vahid, UCR 44 Warp Processing – Custom Communication FPGA NoC – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent Bus Mesh Bus Mesh App1 App2 µP Warp processing can dynamically choose topology FPGA µP FPGA µP Collaboration with Rakesh Kumar University of Illinois, Urbana-Champaign “Amoebic Computing”

45 Frank Vahid, UCR 45 µP Cache Warp Processing Enables Expandable Logic Concept RAM Expandable RAM uP Performance Profiler µP Cache Warp Tools DMA FPGA RAM Expandable RAM – System detects RAM during start, improves performance invisibly Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware. Expandable Logic Planning MICRO submission

46 Frank Vahid, UCR 46 Expandable Logic Used our simulation framework Large speedups – 14x to 400x (on scientific apps) Different apps require different amounts of FPGA Expandable logic allows customization of single platform User selects required amount of FPGA No need to recompile/synthesize

47 Frank Vahid, UCR 47 Current/Future: IBM’s Cell and FPGAs Investigating use of FPGAs to supplement Cell Q: Can Cell-aware code be migrated to FPGA for further speedups? Q: Can multithreaded Cell-unaware code be compiled to Cell/FPGA hybrid for better speedups than Cell alone?

48 Frank Vahid, UCR 48 Current/Future: Distribution Format for Clever Circuits for FPGAs? Code written for microprocessor doesn’t always synthesize into best circuit Designers create clever circuits to implement algorithms (dozens of publications yearly, e.g., FCCM) Can those algorithms be captured in high-level format suitable for compilation to variety of platforms? With big FPGA, small FPGA, or none at all? NSF project, overlaps with SRC warp processing project

49 Frank Vahid, UCR 49 Industrial Interactions Year 2 / 3 Freescale Research visit: F. Vahid to Freescale, Chicago, Spring’06. Talk and full-day research discussion with several engineers. Internships –Scott Sirowy, summer 2006 in Austin (also 2005) Intel Chip prototype: Participated in Intel’s Research Shuttle to build prototype warp FPGA fabric – continued bi-weekly phone meetings with Intel engineers, visit to Intel by PI Vahid and R. Lysecky (now prof. at UofA), several day visit to Intel by Lysecky to simulate design, ready for tapout. June’06–Intel cancelled entire shuttle program as part of larger cutbacks. Research discussions via email with liaison Darshan Patra (Oregon). IBM Internship: Ryan Mannion, summer and fall 2006 in Yorktown Heights. Caleb Leak, summer 2007 being considered. Platform: IBM’s Scott Lekuch and Kai Schleupen 2-day visit to UCR to set up Cell development platform having FPGAs. Technical discussion: Numerous ongoing email and phone interactions with S. Lekuch regarding our research on Cell/FPGA platform. Several interactions with Xilinx also

50 Frank Vahid, UCR 50 Patents “Warp Processing” patent Filed with USPTO summer 2004 Several actions since Still pending SRC has non-exclusive royalty-free license

51 Frank Vahid, UCR 51 Year 1 / 2 publications New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2005. Fast Configurable-Cache Tuning with a Unified Second-Level Cache. A. Gordon-Ross, F. Vahid, N. Dutt. Int. Symp. on Low-Power Electronics and Design (ISLPED), 2005. Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode. G. Stitt, F. Vahid, G. McGregor, B. Einloth. International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005. (Co-authored paper with Freescale) Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware. A. Gordon-Ross and F. Vahid. IEEE Trans. on Computers, Special Issue- Best of Embedded Systems, Microarchitecture, and Compilation Techniques in Memory of B. Ramakrishna (Bob) Rau, Oct. 2005. A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid and S. Tan. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2005. A First Look at the Interplay of Code Reordering and Configurable Caches. A. Gordon-Ross, F. Vahid, N. Dutt. Great Lakes Symposium on VLSI (GLSVLSI), April 2005. A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning. R. Lysecky and F. Vahid. Design Automation and Test in Europe (DATE), March 2005. A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid. Design Automation and Test in Europe (DATE), March 2005.

52 Frank Vahid, UCR 52 Year 2 / 3 publications Binary Synthesis. G. Stitt and F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), 2007 (to appear). Integrated Coupling and Clock Frequency Assignment. S. Sirowy and F. Vahid. International Embedded Systems Symposium (IESS), 2007. Soft-Core Processor Customization Using the Design of Experiments Paradigm. D. Sheldon, F. Vahid and S. Lonardi. Design Automation and Test in Europe, 2007. A One-Shot Configurable-Cache Tuner for Improved Energy and Performance. A Gordon-Ross, P. Viana, F. Vahid and W. Najjar. Design Automation and Test in Europe, 2007. Two Level Microprocessor-Accelerator Partitioning. S. Sirowy, Y. Wu, S. Lonardi and F. Vahid. Design Automation and Test in Europe, 2007. Clock-Frequency Partitioning for Multiple Clock Domains Systems-on-a-Chip. S. Sirowy, Y. Wu, S. Lonardi and F. Vahid Conjoining Soft-Core FPGA Processors. D. Sheldon, R. Kumar, F. Vahid, D.M. Tullsen, R. Lysecky. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2006. A Code Refinement Methodology for Performance-Improved Synthesis from C. G. Stitt, F. Vahid, W. Najjar. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2006. Application-Specific Customization of Parameterized FPGA Soft-Core Processors. D. Sheldon, R. Kumar, R. Lysecky, F. Vahid, D.M. Tullsen. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2006. Warp Processors. R. Lysecky, G. Stitt, F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), July 2006, pp. 659-681. Configurable Cache Subsetting for Fast Cache Tuning. P. Viana, A. Gordon-Ross, E. Keogh, E. Barros, F. Vahid. IEEE/ACM Design Automation Conference (DAC), July 2006. Techniques for Synthesizing Binaries to an Advanced Register/Memory Structure. G. Stitt, Z. Guo, F. Vahid, and W. Najjar. ACM/SIGDA Symp. on Field Programmable Gate Arrays (FPGA), Feb. 2005, pp. 118-124.


Download ppt "Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center."

Similar presentations


Ads by Google