Presentation is loading. Please wait.

Presentation is loading. Please wait.

SCIENCES USC INFORMATION INSTITUTE Pedro C. Diniz, Mary W. Hall, Joonseok Park, Byoungro So and Heidi Ziegler University of Southern California / Information.

Similar presentations

Presentation on theme: "SCIENCES USC INFORMATION INSTITUTE Pedro C. Diniz, Mary W. Hall, Joonseok Park, Byoungro So and Heidi Ziegler University of Southern California / Information."— Presentation transcript:

1 SCIENCES USC INFORMATION INSTITUTE Pedro C. Diniz, Mary W. Hall, Joonseok Park, Byoungro So and Heidi Ziegler University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292 DEFACTO: Combining Parallelizing Compiler Technology with Hardware Behavioral Synthesis* * The DEFACTO project was funded by the Information technology Office (ITO) of the Defense Advanced research project Agency (DARPA) under contract #F30602-98-2-0113.

2 SCIENCES USC INFORMATION INSTITUTE 1 Outline  Background & Motivation  Part 1: Application Mapping Example  Part 2: Design Space Exploration  Part 3: Challenges for Future FPGAs  Related Work  Conclusion

3 SCIENCES USC INFORMATION INSTITUTE 2 DEFACTO Objective & Goals  Objectives: Automatically Map High-Level Applications to Field-Programmable Hardware (FPGAs) Explore Multiple Design Choices  Goal: Make Reconfigurable Technology Accessible to the Average Programmer

4 SCIENCES USC INFORMATION INSTITUTE 3 What Are FPGAs:  Key Concepts Configurable Hardware Reprogrammable (ms latency)  Architecture Configurable Logic Blocks  “Universal” logic  Some input/outputs latched Passive network between CLBs Memories, processor cores

5 SCIENCES USC INFORMATION INSTITUTE 4 Why Use FPGAs?  Advantages over Application-Specific Integrated Circuits (ASICs) Faster Time to Market “Post silicon” Modification Possible Reconfigurable, Possibly Even at Run-time  Advantages Over General-Purpose Processors Application-Specific Customization (e.g., parallelism, small data-widths, arithmetic, bandwidth)  Disadvantages Slow (typical automatic design @25MHz) Low Density of Transistors

6 SCIENCES USC INFORMATION INSTITUTE 5 How to Program FPGAs?  Hardware-Oriented Languages VHDL or Verilog Very Low-Level Programming  Commercial Tools (e.g., Monet TM ) Choose Implementation Based on User Constraints Time and Space Trade-Off Provide Estimations for Implementation  Problem: Too Slow for Large Complex Designs Place-and-Route Can Take up to 8 Hours for Large Designs Unclear What to Do When Things Go Wrong

7 SCIENCES USC INFORMATION INSTITUTE 6 Behavioral Synthesis Example variable A is std_logic_vector(0..7) … X <= (A * B) - (C * D) + F 6 Registers 2 Multipliers 2 Adders/Subtractors 1 (long) clock cycle 9 Registers 2 Multipliers 2 Adders/Subtractors 2 (shorter) clock cycles 13 Registers 1 Multiplier 2 Adders/Subtractors 3 (shorter) clock cycles

8 SCIENCES USC INFORMATION INSTITUTE 7 Synthesizing FPGA Designs: Status  Technology Advances have led to Increasingly Large Parts FPGAs now have Millions of “gates”  Current Practice is to Handcode Designs for FPGAs in Structural VHDL Tedious and Error Prone Requires Weeks to Months Even for Fairly Simple Designs  Higher-level Approach Needed!

9 SCIENCES USC INFORMATION INSTITUTE 8 DEFACTO: Key Ideas  Parallelizing Compiler Technology Complements Behavioral Synthesis Adjusts Parallelism and Data Reuse Optimizes External Memory Accesses  Design Space Exploration Evaluates and Compares Designs before Committing to Hardware Improves Design Time Efficiency a form of Feedback-directed Optimization

10 SCIENCES USC INFORMATION INSTITUTE 9 Opportunities: Parallelism & Storage Behavioral Synthesis Parallelizing Compiler Optimizations: Scalar Variables only Scalars & Multi-Dimensional Arrays inside Loop Body inside Loop Body & Across Iterations Supports User-Controlled Analysis Guides Automatic Loop Loop Unrolling Transformations Manages Registers and Evaluates Tradeoffs of Different inter-operator Communication Memories, On- and Off-chip Considers one FPGA System-level View Performs Allocation, Binding & No Knowledge of Hardware Scheduling of Hardware Implementation

11 SCIENCES USC INFORMATION INSTITUTE Part 1: Mapping Complete Designs from C to FPGAs Sobel Edge Detection Example

12 SCIENCES USC INFORMATION INSTITUTE 11 Example - Sobel Edge Detection char img[IMAGE_SIZE][IMAGE_SIZE], edge [IMAGE_SIZE][IMAGE_SIZE]; int uh1, uh2, threshold; for (i=0; i < IMAGE_SIZE - 4; i++) { for (j=0; j < IMAGE_SIZE - 4; j++) { uh1= (((-img[i][j]) + (- (2* img[i+1][j])) + (-img[i+2][j])) + ((img[i][j-2]) + (2* img[i+1][j-2]) + (img[i+2][j-2]))); uh2= (((-img[i][j]) + (img[i+2][j])) + (- (2* img[i][j-1])) + (2* img[i+2][j-1]) + ((- img[i][j-2]) + (img[i][j-2]))); if ((abs(uh1) + abs(uh2)) < threshold) edge[i][j] =”0xFF”; else edge[i][j] =”0x00; } edge img -1 -2 -1 0 0 0 1 2 1 1 0 -1 2 0 -2 1 0 -1 threshold

13 SCIENCES USC INFORMATION INSTITUTE 12 Sobel - A Naïve Implementation  Large Number of Adders and Multipliers (shifts in this case)  Too Many Memory Accesses ! 8 Reads and 1 Write per Iteration of the Loop Observation  Across 2 Iterations 4 out of 8 Values Can Be Reused 0x00 0xFF img[i][j]img[i][j+1]img[i][j+2] img[i+2][j]img[i+2][j+1]img[i+2][j+2] img[i+1][j+2] img[i+1][j] edge[i][j]

14 SCIENCES USC INFORMATION INSTITUTE 13 Data Reuse Analysis - Sobel img[i][j+1] img[i+2][j+1] img[i][j+2] img[i+2][j+2] img[i+1][j+2] img[i][j] img[i+2][j] img[i+1][j] d = (1,0)d = (2,0)d = (1,0) img[i][j+1] img[i+2][j+1] img[i][j+2] img[i+2][j+2] img[i+1][j+2] img[i][j] img[i+2][j] img[i+1][j] d = (0,1) d = (0,2) d = (0,1)

15 SCIENCES USC INFORMATION INSTITUTE 14 Data Reuse using Tapped-Delay Lines  Reduce the Number of Memory Accesses  Exploit Array Layout and Distribution Packing Stripping  Examples: 0x00 0xFF img[i][j] img[i+1][j] img[i+2][j] edge[i][j] img[i][j] img[i][j+1] img[i][j+2] 0x00 0xFF Accesses = 0.25 + 0.25 + 0.25 + 0.25 = 1.0 Accesses = 1.0 + 1.0 + 1.0 + 1.0 = 4.0

16 SCIENCES USC INFORMATION INSTITUTE 15 Overall Design Approach MEM Application Data-path Application Data-path  Application Data-paths Extract Body of Loops Uses Behavioral Synthesis  Memory Interfaces Uses Data Access Patterns to Generate Channel Specs VHDL Library Templates

17 SCIENCES USC INFORMATION INSTITUTE 16 WildStar TM : A Complex Memory HierarchySharedMemory0SharedMemory0 SharedMemory2SharedMemory2 SharedMemory3SharedMemory3 SharedMemory1SharedMemory1 FPGA 1 FPGA 0 FPGA 2 SRAM1SRAM1 SRAM0SRAM0 SRAM3SRAM3 SRAM2SRAM2 PCIControllerPCIController To Off-Board Memory 32bits 64bits

18 SCIENCES USC INFORMATION INSTITUTE 17 Project Status  Complex Infrastructure Different Programming Languages (C vs. VHDL) Different EDA Tools Different Vendors Experimental Target In-House Tools  Combines Compiler Techniques and Behavioral Synthesis Different Execution Models Reconcile Representation  It Works! Fully Automated for Single FPGA designs Modest Manual Intervention for Multi-FPGA designs (simulation OK) Compiler Analysis SUIF2VHDL Behavioral Synthesis & Estimation (Monet) Logic Synthesis (Synplicity) Place & Route (Xilinx Foundations) Code Transformations and Annotations Annapolis WildStar Board Algorithm Description Computation & Data Partitioning Design Space Exploration Memory Access Protocols

19 SCIENCES USC INFORMATION INSTITUTE 18 Sobel on the Annapolis WildStar Board Input ImageOutput Image Manual vs. Automated MetricsManualAutomated Space (slices)22382279 (2% increase) Cycles326K518K Clock Rate (MHz)4240 Execution Time (one frame)7.7 ms (100%)12.95 ms (159%) Design Timeabout 1 week42 minutes

20 SCIENCES USC INFORMATION INSTITUTE Part 2: Design Space Exploration Using Behavioral Synthesis Estimates

21 SCIENCES USC INFORMATION INSTITUTE 20 Design Space Exploration (Current Practice) Logic Synthesis / Place&Route Design Specification (Low-level VHDL) Design Modification Validation / Evaluation  2 Weeks for a Working Design  2 Months for an Optimized Design Correct? Good design?

22 SCIENCES USC INFORMATION INSTITUTE 21 Design Space Exploration (Our Approach) Algorithm (C/Fortran) Compiler Optimizations (SUIF) Unroll and Jam Scalar Replacement Custom Data Layout SUIF2VHDL Translation Behavioral Synthesis Estimation Unroll Factor Selection Logic Synthesis / Place&Route  Overall, Less than 2 hours  5 Minutes for Optimized Design Selection

23 SCIENCES USC INFORMATION INSTITUTE 22 Problem Statement Execution Time Exploit parallelism, Reuse data on chip Space Requirements More copies of operators, More on-chip registers  Constraint: Size of design less than FPGA capacity  Goal: Minimal execution time  Selection Criteria : For given performance, minimal space Frees up more space for other computations Better clock rate achieved Desirable to use on-chip space efficiently

24 SCIENCES USC INFORMATION INSTITUTE 23 Balance  Definition: Data Fetch Rate Consumption Rate Consumption Rate[bits/cycle] = data bits consumed per computation time  Limited by the Data Dependences of the Computation Data fetch Rate[bits/cycle] = data bits required per computation time  Limited by the FPGA’s Effective Memory Bandwidth  If balance > 1, Compute Bound  If balance < 1, Memory Bound  Balance suggests whether more resources should be devoted to enhance computation or storage.

25 SCIENCES USC INFORMATION INSTITUTE 24 Do I=1,N, by 2 A(I) = A(I-2) + B(I) A(I+1) = A(I-1) + B(I+1) Loop Unrolling  Exposes fine-grain parallelism by replicating the loop body. Do I=1, N A(I) = A(I-2) + B(I) A(I) B(I) A(I) A(I-2) A(I) A(I+1) A(I-1) B(I+1) B(I)  As Unrolling Factor Increases, both Data Fetch and Consumption Rate Increase. 2 2 2

26 SCIENCES USC INFORMATION INSTITUTE 25 Monotonicity Properties unroll factor Data Fetch Rate (bits/cycle) Saturation point Data Consumption Rate (bits/cycle) Saturation point Balance (= Fetch/Consumption) unroll factor Saturation point: unroll factor that saturates memory bandwidth for a given architecture

27 SCIENCES USC INFORMATION INSTITUTE 26 Balance & Optimal Unroll Factor Unroll factor Rate (bits/cycle) 1 3 5 2 4 Saturation point max Data fetch rate Data consumption rate Optimal solution Balance Guides the Design Space Exploration.

28 SCIENCES USC INFORMATION INSTITUTE 27 Experiments  Multimedia Kernels FIR (Finite Impulse Response) Matrix Multiply Sobel (Edge Detection) Pattern Matching Jacobi (Five Point Stencil)  Methodology Compiler Translates C to SUIF and Behavioral VHDL Synthesis Tool Estimates Space and Computational Latency Compiler Computes Balance and Execution Time Accounting for Memory Latency  Memory Latency Pipelined: 1 cycle for read and write Non-pipelined: 7 cycles for read and 3 cycles for write

29 SCIENCES USC INFORMATION INSTITUTE 28 FIR + + x x * * Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 4 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 16 Outer Loop Unroll Factor 32 Outer Loop Unroll Factor 64 Speedup: 17.26 Selected Design

30 SCIENCES USC INFORMATION INSTITUTE 29 Matrix Multiply + + x x Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 4 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 16 Outer Loop Unroll Factor 32 Selected Design Speedup: 13.36

31 SCIENCES USC INFORMATION INSTITUTE 30 Efficiency of Design Space Exploration ProgramSearch SpaceSearched points FIR2048 (24)3 Matrix Multiply2048 (24)4 Jacobi512 (27)5 Pattern768 (20)4 Sobel2048 (24)2  On average, only 0.3% (15%) of Space Searched.

32 SCIENCES USC INFORMATION INSTITUTE 31 FIR: Estimation vs. Accurate Data  Larger Designs Lead to Degradation in Clock Rates  Compiler Can Use a Statistical Approach to Derive Confidence Intervals for Space  Our case: Compiler Makes Correct Decision using Imperfect Data

33 SCIENCES USC INFORMATION INSTITUTE Part 3: Challenges for Future FPGAs Heterogeneous Functional and Storage Resources Data/Computation Partitioning and Scheduling Revisited

34 SCIENCES USC INFORMATION INSTITUTE 33 Field-Programmable-Core-Arrays  Large Number of Transistors Multiple Application Specific Cores Customization of Interconnect Other Specialized Logic  Challenges: Data Partitioning:  Custom Storage Structures  Allocation, Binding and Scheduling  Replication and Reorganization Computation Partition  Scheduling between Cores  Coarse-Grain Pipelining  Revisiting Issues with Parallelizing Compiler Technology IP Core S-RAM D-RAM IP Core DSP IP Core ARM

35 SCIENCES USC INFORMATION INSTITUTE 34 Related Work  Compilers for Special-purpose Configurable Architectures PipeRench (CMU), RaPiD (UW), RAW (MIT)  High-level Languages Oriented towards Hardware Handel-C, Cameron(CSU), PICO(HP), Napa-C (LANL)  Integrated Compiler and Logic Synthesis Babb (MIT), Nimble (Synopsys)  Compiling from MatLab to FPGAs Match compiler (Northwestern)

36 SCIENCES USC INFORMATION INSTITUTE 35 Conclusion  Combines Behavioral Synthesis and Parallelizing Compiler Technologies  Fast & Automated Design Space Exploration Trades Space with Functional Units via Loop Unrolling Uses Balance and Monotonicity Properties Searches only 0.3% of the Entire Design Space  Near-optimal Performance and Smallest Space  Future FPGAs Coarser-grained, Custom Functional and Storage Structures Multiprocessor on a Chip Data and Computation Partitioning and Coarse Grain Scheduling


Download ppt "SCIENCES USC INFORMATION INSTITUTE Pedro C. Diniz, Mary W. Hall, Joonseok Park, Byoungro So and Heidi Ziegler University of Southern California / Information."

Similar presentations

Ads by Google