Presentation on theme: "SCIENCES USC INFORMATION INSTITUTE Pedro C. Diniz, Mary W. Hall, Joonseok Park, Byoungro So and Heidi Ziegler University of Southern California / Information."— Presentation transcript:
SCIENCES USC INFORMATION INSTITUTE Pedro C. Diniz, Mary W. Hall, Joonseok Park, Byoungro So and Heidi Ziegler University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California DEFACTO: Combining Parallelizing Compiler Technology with Hardware Behavioral Synthesis* * The DEFACTO project was funded by the Information technology Office (ITO) of the Defense Advanced research project Agency (DARPA) under contract #F
SCIENCES USC INFORMATION INSTITUTE 1 Outline Background & Motivation Part 1: Application Mapping Example Part 2: Design Space Exploration Part 3: Challenges for Future FPGAs Related Work Conclusion
SCIENCES USC INFORMATION INSTITUTE 2 DEFACTO Objective & Goals Objectives: Automatically Map High-Level Applications to Field-Programmable Hardware (FPGAs) Explore Multiple Design Choices Goal: Make Reconfigurable Technology Accessible to the Average Programmer
SCIENCES USC INFORMATION INSTITUTE 3 What Are FPGAs: Key Concepts Configurable Hardware Reprogrammable (ms latency) Architecture Configurable Logic Blocks “Universal” logic Some input/outputs latched Passive network between CLBs Memories, processor cores
SCIENCES USC INFORMATION INSTITUTE 4 Why Use FPGAs? Advantages over Application-Specific Integrated Circuits (ASICs) Faster Time to Market “Post silicon” Modification Possible Reconfigurable, Possibly Even at Run-time Advantages Over General-Purpose Processors Application-Specific Customization (e.g., parallelism, small data-widths, arithmetic, bandwidth) Disadvantages Slow (typical automatic Low Density of Transistors
SCIENCES USC INFORMATION INSTITUTE 5 How to Program FPGAs? Hardware-Oriented Languages VHDL or Verilog Very Low-Level Programming Commercial Tools (e.g., Monet TM ) Choose Implementation Based on User Constraints Time and Space Trade-Off Provide Estimations for Implementation Problem: Too Slow for Large Complex Designs Place-and-Route Can Take up to 8 Hours for Large Designs Unclear What to Do When Things Go Wrong
SCIENCES USC INFORMATION INSTITUTE 6 Behavioral Synthesis Example variable A is std_logic_vector(0..7) … X <= (A * B) - (C * D) + F 6 Registers 2 Multipliers 2 Adders/Subtractors 1 (long) clock cycle 9 Registers 2 Multipliers 2 Adders/Subtractors 2 (shorter) clock cycles 13 Registers 1 Multiplier 2 Adders/Subtractors 3 (shorter) clock cycles
SCIENCES USC INFORMATION INSTITUTE 7 Synthesizing FPGA Designs: Status Technology Advances have led to Increasingly Large Parts FPGAs now have Millions of “gates” Current Practice is to Handcode Designs for FPGAs in Structural VHDL Tedious and Error Prone Requires Weeks to Months Even for Fairly Simple Designs Higher-level Approach Needed!
SCIENCES USC INFORMATION INSTITUTE 8 DEFACTO: Key Ideas Parallelizing Compiler Technology Complements Behavioral Synthesis Adjusts Parallelism and Data Reuse Optimizes External Memory Accesses Design Space Exploration Evaluates and Compares Designs before Committing to Hardware Improves Design Time Efficiency a form of Feedback-directed Optimization
SCIENCES USC INFORMATION INSTITUTE 9 Opportunities: Parallelism & Storage Behavioral Synthesis Parallelizing Compiler Optimizations: Scalar Variables only Scalars & Multi-Dimensional Arrays inside Loop Body inside Loop Body & Across Iterations Supports User-Controlled Analysis Guides Automatic Loop Loop Unrolling Transformations Manages Registers and Evaluates Tradeoffs of Different inter-operator Communication Memories, On- and Off-chip Considers one FPGA System-level View Performs Allocation, Binding & No Knowledge of Hardware Scheduling of Hardware Implementation
SCIENCES USC INFORMATION INSTITUTE Part 1: Mapping Complete Designs from C to FPGAs Sobel Edge Detection Example
SCIENCES USC INFORMATION INSTITUTE 12 Sobel - A Naïve Implementation Large Number of Adders and Multipliers (shifts in this case) Too Many Memory Accesses ! 8 Reads and 1 Write per Iteration of the Loop Observation Across 2 Iterations 4 out of 8 Values Can Be Reused 0x00 0xFF img[i][j]img[i][j+1]img[i][j+2] img[i+2][j]img[i+2][j+1]img[i+2][j+2] img[i+1][j+2] img[i+1][j] edge[i][j]
SCIENCES USC INFORMATION INSTITUTE 13 Data Reuse Analysis - Sobel img[i][j+1] img[i+2][j+1] img[i][j+2] img[i+2][j+2] img[i+1][j+2] img[i][j] img[i+2][j] img[i+1][j] d = (1,0)d = (2,0)d = (1,0) img[i][j+1] img[i+2][j+1] img[i][j+2] img[i+2][j+2] img[i+1][j+2] img[i][j] img[i+2][j] img[i+1][j] d = (0,1) d = (0,2) d = (0,1)
SCIENCES USC INFORMATION INSTITUTE 14 Data Reuse using Tapped-Delay Lines Reduce the Number of Memory Accesses Exploit Array Layout and Distribution Packing Stripping Examples: 0x00 0xFF img[i][j] img[i+1][j] img[i+2][j] edge[i][j] img[i][j] img[i][j+1] img[i][j+2] 0x00 0xFF Accesses = = 1.0 Accesses = = 4.0
SCIENCES USC INFORMATION INSTITUTE 15 Overall Design Approach MEM Application Data-path Application Data-path Application Data-paths Extract Body of Loops Uses Behavioral Synthesis Memory Interfaces Uses Data Access Patterns to Generate Channel Specs VHDL Library Templates
SCIENCES USC INFORMATION INSTITUTE 16 WildStar TM : A Complex Memory HierarchySharedMemory0SharedMemory0 SharedMemory2SharedMemory2 SharedMemory3SharedMemory3 SharedMemory1SharedMemory1 FPGA 1 FPGA 0 FPGA 2 SRAM1SRAM1 SRAM0SRAM0 SRAM3SRAM3 SRAM2SRAM2 PCIControllerPCIController To Off-Board Memory 32bits 64bits
SCIENCES USC INFORMATION INSTITUTE 17 Project Status Complex Infrastructure Different Programming Languages (C vs. VHDL) Different EDA Tools Different Vendors Experimental Target In-House Tools Combines Compiler Techniques and Behavioral Synthesis Different Execution Models Reconcile Representation It Works! Fully Automated for Single FPGA designs Modest Manual Intervention for Multi-FPGA designs (simulation OK) Compiler Analysis SUIF2VHDL Behavioral Synthesis & Estimation (Monet) Logic Synthesis (Synplicity) Place & Route (Xilinx Foundations) Code Transformations and Annotations Annapolis WildStar Board Algorithm Description Computation & Data Partitioning Design Space Exploration Memory Access Protocols
SCIENCES USC INFORMATION INSTITUTE 18 Sobel on the Annapolis WildStar Board Input ImageOutput Image Manual vs. Automated MetricsManualAutomated Space (slices) (2% increase) Cycles326K518K Clock Rate (MHz)4240 Execution Time (one frame)7.7 ms (100%)12.95 ms (159%) Design Timeabout 1 week42 minutes
SCIENCES USC INFORMATION INSTITUTE Part 2: Design Space Exploration Using Behavioral Synthesis Estimates
SCIENCES USC INFORMATION INSTITUTE 20 Design Space Exploration (Current Practice) Logic Synthesis / Place&Route Design Specification (Low-level VHDL) Design Modification Validation / Evaluation 2 Weeks for a Working Design 2 Months for an Optimized Design Correct? Good design?
SCIENCES USC INFORMATION INSTITUTE 21 Design Space Exploration (Our Approach) Algorithm (C/Fortran) Compiler Optimizations (SUIF) Unroll and Jam Scalar Replacement Custom Data Layout SUIF2VHDL Translation Behavioral Synthesis Estimation Unroll Factor Selection Logic Synthesis / Place&Route Overall, Less than 2 hours 5 Minutes for Optimized Design Selection
SCIENCES USC INFORMATION INSTITUTE 22 Problem Statement Execution Time Exploit parallelism, Reuse data on chip Space Requirements More copies of operators, More on-chip registers Constraint: Size of design less than FPGA capacity Goal: Minimal execution time Selection Criteria : For given performance, minimal space Frees up more space for other computations Better clock rate achieved Desirable to use on-chip space efficiently
SCIENCES USC INFORMATION INSTITUTE 23 Balance Definition: Data Fetch Rate Consumption Rate Consumption Rate[bits/cycle] = data bits consumed per computation time Limited by the Data Dependences of the Computation Data fetch Rate[bits/cycle] = data bits required per computation time Limited by the FPGA’s Effective Memory Bandwidth If balance > 1, Compute Bound If balance < 1, Memory Bound Balance suggests whether more resources should be devoted to enhance computation or storage.
SCIENCES USC INFORMATION INSTITUTE 24 Do I=1,N, by 2 A(I) = A(I-2) + B(I) A(I+1) = A(I-1) + B(I+1) Loop Unrolling Exposes fine-grain parallelism by replicating the loop body. Do I=1, N A(I) = A(I-2) + B(I) A(I) B(I) A(I) A(I-2) A(I) A(I+1) A(I-1) B(I+1) B(I) As Unrolling Factor Increases, both Data Fetch and Consumption Rate Increase
SCIENCES USC INFORMATION INSTITUTE 25 Monotonicity Properties unroll factor Data Fetch Rate (bits/cycle) Saturation point Data Consumption Rate (bits/cycle) Saturation point Balance (= Fetch/Consumption) unroll factor Saturation point: unroll factor that saturates memory bandwidth for a given architecture
SCIENCES USC INFORMATION INSTITUTE 26 Balance & Optimal Unroll Factor Unroll factor Rate (bits/cycle) Saturation point max Data fetch rate Data consumption rate Optimal solution Balance Guides the Design Space Exploration.
SCIENCES USC INFORMATION INSTITUTE 27 Experiments Multimedia Kernels FIR (Finite Impulse Response) Matrix Multiply Sobel (Edge Detection) Pattern Matching Jacobi (Five Point Stencil) Methodology Compiler Translates C to SUIF and Behavioral VHDL Synthesis Tool Estimates Space and Computational Latency Compiler Computes Balance and Execution Time Accounting for Memory Latency Memory Latency Pipelined: 1 cycle for read and write Non-pipelined: 7 cycles for read and 3 cycles for write
SCIENCES USC INFORMATION INSTITUTE 30 Efficiency of Design Space Exploration ProgramSearch SpaceSearched points FIR2048 (24)3 Matrix Multiply2048 (24)4 Jacobi512 (27)5 Pattern768 (20)4 Sobel2048 (24)2 On average, only 0.3% (15%) of Space Searched.
SCIENCES USC INFORMATION INSTITUTE 31 FIR: Estimation vs. Accurate Data Larger Designs Lead to Degradation in Clock Rates Compiler Can Use a Statistical Approach to Derive Confidence Intervals for Space Our case: Compiler Makes Correct Decision using Imperfect Data
SCIENCES USC INFORMATION INSTITUTE Part 3: Challenges for Future FPGAs Heterogeneous Functional and Storage Resources Data/Computation Partitioning and Scheduling Revisited
SCIENCES USC INFORMATION INSTITUTE 33 Field-Programmable-Core-Arrays Large Number of Transistors Multiple Application Specific Cores Customization of Interconnect Other Specialized Logic Challenges: Data Partitioning: Custom Storage Structures Allocation, Binding and Scheduling Replication and Reorganization Computation Partition Scheduling between Cores Coarse-Grain Pipelining Revisiting Issues with Parallelizing Compiler Technology IP Core S-RAM D-RAM IP Core DSP IP Core ARM
SCIENCES USC INFORMATION INSTITUTE 34 Related Work Compilers for Special-purpose Configurable Architectures PipeRench (CMU), RaPiD (UW), RAW (MIT) High-level Languages Oriented towards Hardware Handel-C, Cameron(CSU), PICO(HP), Napa-C (LANL) Integrated Compiler and Logic Synthesis Babb (MIT), Nimble (Synopsys) Compiling from MatLab to FPGAs Match compiler (Northwestern)
SCIENCES USC INFORMATION INSTITUTE 35 Conclusion Combines Behavioral Synthesis and Parallelizing Compiler Technologies Fast & Automated Design Space Exploration Trades Space with Functional Units via Loop Unrolling Uses Balance and Monotonicity Properties Searches only 0.3% of the Entire Design Space Near-optimal Performance and Smallest Space Future FPGAs Coarser-grained, Custom Functional and Storage Structures Multiprocessor on a Chip Data and Computation Partitioning and Coarse Grain Scheduling
SCIENCES USC INFORMATION INSTITUTE 36 Thank You