Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fall 2006Lecture 16 Lecture 16: Accelerator Design in the XUP Board ECE 412: Microcomputer Laboratory.

Similar presentations


Presentation on theme: "Fall 2006Lecture 16 Lecture 16: Accelerator Design in the XUP Board ECE 412: Microcomputer Laboratory."— Presentation transcript:

1 Fall 2006Lecture 16 Lecture 16: Accelerator Design in the XUP Board ECE 412: Microcomputer Laboratory

2 Fall 2006Lecture 16 Objectives Understand accelerator design considerations in a practical FPGA environment Gain knowledge in some details of the XUP platform required for efficient accelerator design

3 Fall 2006Lecture 16 Four Fundamental Models of Accelerator Design Base No OS Service (in simple embedded systems) OS service acc as User space mmaped I/O device Virtualized Device with OS sched support

4 Fall 2006Lecture 16 Hybrid Hardware/Software Execution Model CPU FPGA accele- rators memory devices Linux OS Linker/Loader Application DLL OS modules Compiler analysis/transformations Synthesis Soft object Hard object User level function or device driver: Source code Resource manager Compile Time User Runtime Kernel Runtime Human designed hardware Hardware Accelerator as a DLL –Seamless integration of hardware accelerators into the Linux software stack for use by mainstream applications –The DLL approach enables transparent interchange of software and hardware components Application level execution model –Compiler deep analysis and transformations generate CPU code, hardware library stubs and synthesized components –FPGA bitmaps as hardware counterpart to existing software modules. –Same dynamic linking library interfaces and stubs apply to both software and hardware implementation OS resource management –Services (API) for allocation, partial reconfiguration, saving and restoring the status, and monitoring –Multiprogramming scheduler can pre-fetch hardware accelerators in time for next use –Control the access to the new hardware to ensure trust under private or shared use

5 Fall 2006Lecture 16 MP3 Decoder: Madplay Lib. Dithering as DLL Madplay shared library dithering function as software and FPGA DLL –Audio_linear_dither() software profiling shows 97% of application time –DL (dynamic linker) can switch the call to hardware or software implementation Used by ~100 video and audio applications Application Sound driver AC’97 OS FPGA Stub Software Dithering DLL QuantizationClippingDitheringRandom generatorBiasing Noise Shaping QuantizationClippingDithering Random generator Biasing Noise Shaping Hardware Dithering DLL Hardware Dithering 6 cycles Decode MP3 Block Read Sample DL Write Sample Application Sound driver AC’97 OS FPGA Stub Software Dithering DLL QuantizationClippingDitheringRandom generatorBiasing Noise Shaping QuantizationClippingDitheringRandom generatorBiasing Noise Shaping QuantizationClippingDithering Random generator Biasing Noise Shaping QuantizationClippingDithering Random generator Biasing Noise Shaping Hardware Dithering DLL Hardware Dithering Decode MP3 Block Read Sample DL Write Sample

6 Fall 2006Lecture 16 CPU-Accelerator Interconnect Options PLB (Processor Local Bus) –Wide transfer – 64 bits –Access to DRAM channel –1/3 CPU frequency –Big penalty if bus is busy during first attempt to access bus OCM (On-chip Memory) interconnect –Narrower – 32 bits –No direct access to DRAM channel –CPU clock frequency

7 Fall 2006Lecture 16 Motion Estimation Design & Experience Significant overhead in mmap, open calls –This arrangement can only support accelerators that will be invoked many times –Notice dramatic reduction in computation time –Notice large overhead in data marshalling and white Full Search gives 10% better compression –Diamond Search is sequential, not suitable for acceleration

8 Fall 2006Lecture 16 JPEG: An Example RGB 2D Discrete Cosine Transform (DCT) Run-Length Encoding (RLE) Huffman Coding (HC) Quantization (QUANT) Original Image Compressed Image Parallel Execution on Independent Blocks Inherently Sequential Region Implemented as Reconfigurable Logic Accelerator Candidate Downsampl e Y U V RGB to YUV

9 Fall 2006Lecture 16 JPEG Accelerator Design & Experience Based on Model (d) –System call overhead for each invocation –Better protection DCT and Quant are accelerated –Data flows directly from DCT to Quant Data copy to user DMA buffer dominates cost

10 Fall 2006Lecture 16 Execution Flow of DCT System Call ApplicationOperating SystemHardware Time  open(/dev/accel); /* only once*/ … /* construct macroblocks */ macroblock = … syscall(&macroblock, num_blocks) … /* macroblock now has transformed data */ … Data copy PPC Flush Cache Range Setup DMA Transfer PPC Poll DMA Controller Setup DMA Transfer Invalidate Cache Range Memory PLB PPC Accelerator (Executing) DCR Data Copy PPC Memory PLB PPC DMA Controller PLB PPC Memory PLB PPC Memory PLB Enable Accelerator Access for Application

11 Fall 2006Lecture 16 Software Versus Hardware Acceleration Overhead is a major issue!

12 Fall 2006Lecture 16 Device Driver Access Cost


Download ppt "Fall 2006Lecture 16 Lecture 16: Accelerator Design in the XUP Board ECE 412: Microcomputer Laboratory."

Similar presentations


Ads by Google