Presentation is loading. Please wait.

Presentation is loading. Please wait.

Image Processing With FPGAs Zach Fuchs Sarit Patel EEL6935 14 April 2008.

Similar presentations


Presentation on theme: "Image Processing With FPGAs Zach Fuchs Sarit Patel EEL6935 14 April 2008."— Presentation transcript:

1 Image Processing With FPGAs Zach Fuchs Sarit Patel EEL6935 14 April 2008

2 FPGA-Based Configurable Systolic Architecture for Window-Based Image Processing Authors: César Torres-Huitzil Miguel Arias-Estrada

3 Introduction Image processing is a fundamental step in modern machine vision systems. Many complex algorithms use lower level results to pursue higher level goals. –e.g.: edge detection to determine object Real time performance in video applications is usually required.

4 Difficulty Building Systems Most computer vision applications are computationally intensive –Sequential nature of conventional processors slow down performance Different computations in processing limits parallelization Real time performance is required

5 Sample Applications Robotics Multimedia Virtual reality Industrial inspection Medical engineering Autonomous navigation

6 Goals of Paper Design 2D systolic architecture for window- based image processing Consider design issues: –Flexibility –Silicon area –Power consumption –Performance –Area

7 Window-Based Image Processing Large number of repetitive neighbor operations over image data Area of w x w pixels extracted from image Transformed according to window mask and mathematical functions Produce single, new output according to transform

8 Windows-Based Image Processing 1 2 3

9 Window-Based Operators Same scalar function applied on a pixel by pixel basis Scalar functions –e.g.: relational, arithmetic, logical, look up tables Reduction functions –Reduce window of results from scalar function to one output –e.g.: accumulation, maximum, absolute value

10 Computational Requirements Window-based operations are computationally expensive tasks Focusing on convolution –Convolution - the amount of overlap between f and a reversed and translated version of g In general, complexity = O(w^2 x M x N) –w x w window mask –M x N image

11 Data Transfer Rate Must transfer data between image acquisition module, memory, and processor Input Data Transfer Rate Output Data Transfer Rate –b = # of bits per pixel –f F = processing rate of images per second Requires efficient use of communication bandwidth and parallel processing

12 Implementation Technology: FPGA Provides massive parallel structures and high density for logic arithmetic Tasks implemented by spatially rather than temporally Possible to control at bit level to build specialized data paths Offer more raw computational power compared to conventional processors Shorter design cycles than ASICs Well suited for implementing parallel architectures.

13 Memory Accesses Gap between processor speed and memory access speed –Memory access overhead critical issue Window-based operations are memory intensive; require new pixel in each step High potential for parallelism since independent operations are applied to large regions of image arrays

14 Memory Accesses Pixels might not be stored as neighboring elements –Parallelism is hidden Windows usually overlap with neighboring windows Must create vectors of data elements and process them using parallel vectorization techniques.

15 Overlapping Windows Three windows shown; shaded box indicates overlapping data.

16 Overlapping Windows Some pixels can be used in computation of all three windows Reduce memory accesses for those pixels by a factor of 3 Large number of windows means less overlap Must compromise between data overlap and window count

17 Data Parallelism Can be combined with loop unrolling to diminish memory accesses for sequential accesses Process one window, then slide to the right and process next Unroll this loop so more windows are computed in parallel Authors use vertical unrolling –Can apply to horizontal unrolling equally

18 Data Parallelism Number of pixels read per column is directly dependent on number of rows processed in parallel Number of pixels read = w + N R – 1 –w = windows mask length/width –N R = rows processed Number of Memory Accesses (MxN Image)

19 Data Parallelism

20 Systolic Architecture Configurable Window Processor (CWP) –Processing element in systolic arch. Architecture reads data from input memory –P = image pixel –W = window mask coefficients Transmitted to array of processing elements for computation

21 Array of CWPs LDC = Local data collector –Collects results of CWPs CWP –Compute a window operator on same column of input image D = Delay line / shift register –Used for synchronization purposes

22 Architecture Flow Pixel is broadcast to all CWPs At each clock cycle: – Each CWP receives a different window coefficient –New image pixel for all processing elements Each CWP multiplies and accumulates values until all pixels in a window are processed After short latency, the LDC will collect the data and send it to output memory

23 CWP AP – Arithmetic Processor (ALU) –Multiplies LRM – Local Reduction Module –Accumulator Pc – Result of window operation Wd – delayed window coefficient

24 Systolic Architecture

25 Processing Time Latency –Time required to start pipeline operation –Measured between activation of first CWP to last CWP Parallel processing time –Time when all CWPs are working in parallel –Addition of all times to process set of rows Performance compromised with number of rows processed –Directly reflects silicon resources allocated to architecture

26 Throughput Number of elemental operations system can perform per second Only scalar function and local reduction function are considered

27 Implementation Fully parameterizable VHDL description –Use generics to make design flexible Structural description used only elementary logic operations Design is platform, version, technology, and tool independent Used XCV2000E-6 VirtexE FPGA w/ 2 Million Gates

28 FPGA Technical Data

29 Performance Results I/O time not considered in results 512x512 Image w/ 7x7 Window Mask

30 Performance Results Image processing time for 7x7 window mask is 8.35 ms Leaves enough time for image acquisition 30ms required for real- time constraints Post-processing also possible

31 Performance Results Throughput increases with number of processing elements Utilization and activity efficiency of processing elements decrease

32 Improving Performance Optimize design mapped on the FPGA Apply timing restrictions for increased speed Use better FPGA Note that performance requirement for real- time operation is still met with lower FPGA

33 Comparisons to Other Architectures

34 Area/Performance Tradeoffs Low resource utilization allows implementation in compact mobile apps High computational density due to small area usage Can reduce hardware or clock frequency –Reduces power –Still meets timing requirements

35 Reconfigurability Flexible enough to support different window- based image operators Allows different image-based applications on a SoC

36 Conclusion Easy to exploit SIMD for parallelism in image processing FPGAs allow reconfigurability and flexibility Real-time constraints can be met with high performance and low area usage All Images and Graphs from: Torres-Huitzil, Cesar, and Miguel Arias-Estrada. "FPGA-Based Configurable Systolic Architecture for Window-Based Image Processing." EURASIP Journal on Applied Signal Processing 7(2005): 1024-1034.

37 Hardware, Design and Implementation Issues on a FPGA- Based Smart Camera Fabio Dias, Francois Berry, Jocelyn Serot, Francois Marmoiton

38 Summary of Paper Describe the hardware architecture of a FPGA- based Smart Camera research platform and some of the hardware design issues. Propose a architectural design methodology based on pre-programmed processing elements. Provide a low level image processing example. Present an embedded tracking application to show the camera’s utilization.

39 What is a Smart Camera? Smart cameras utilize embedded processing to relieve some of the low level computational burden of the interfacing system. Reduce communication flow and overhead. Processing resources consist of FPGA devices, medi/streaming processors, DSP’s, etc.

40 Why FPGA devices? Reconfigurability –Allows the camera to adapt to a wide range of applications. Parallelism –Take advantage of independence of many computational tasks in order to meet time restraints. Hardware Flexibility –Capable of interfacing with a wide range of external devices such as memory or ASICs.

41 Smart Camera Hardware Architecture ALTERA Stratix EP1S60F1020C7 4Mpixels LUPA-400 image sensor (2) 2d accelerometers (3) gyroscopes 10Mb SRAM 64Mb SDRAM

42 Smart Camera Hardware Architecture

43 Design Methodology Centralized around reconfiguration of the FPGA. –Set of Pre-designed configurable data processing elements (PE’s). –Programmable Control Module System supervisor, communicating with the PE’s through registers and hand-shake signals Configures and synchronizes different PE’s

44 Design Methodology Schematic of a SoPC architecture illustrating the proposed methodological approach.

45 Generic Window-Based Processing Element Applied over a small defined over a small defined portion of the input image. Deal with large amounts of data because they are often applied over the entire image. Examples –Convolution –Correlation estimation –Morphological transformations

46 Generic Window-Based Processing Element

47 Smart Camera Application Template Tracking System VGA images sent to host computer to be displayed. The user selects frame of interest for tracking. A search window is acquired and stored into memory. A sliding window SAD algorithm is applied. The portion with the best correlation score is considered the as being the new template location. A null acceleration model is employed in order to predict displacement in the next frame.

48 Smart Camera Application Embedded tracking implemented architecture

49 Experimental Results

50 Conclusion Generic window-based processing element successfully implemented in an FPGA. An image tracking algorithm utilizing the described design methodology successfully implemented with adequate performance. A flexible FPGA base smart camera research platform created for future research. All Images and Graphs from: Dias, Fabio, Francois Berry, Jocelyn Serot, and Francois Marmoiton, "HARDWARE, DESIGN AND IMPLEMENTATION ISSUES ON A FPGA-BASED SMART CAMERA." IEEE 1-4244-1354-0/07(2007): 20-26.


Download ppt "Image Processing With FPGAs Zach Fuchs Sarit Patel EEL6935 14 April 2008."

Similar presentations


Ads by Google